SWISH::Prog::Aggregator::Object - index Perl objects with Swish-e


SWISH-Prog documentation Contained in the SWISH-Prog distribution.

Index


Code Index:

NAME

Top

SWISH::Prog::Aggregator::Object - index Perl objects with Swish-e

SYNOPSIS

Top

    my $aggregator = SWISH::Prog::Aggregator::Object->new(
        methods => [qw( foo bar something something_else )],
        class   => 'MyClass',
        title   => 'mytitle',
        url     => 'myurl',
        modtime => 'mylastmod'
        indexer => SWISH::Prog::Indexer::Native->new,
    );

    my $data = my_func_for_fetching_data();
    # $data is either iterator or arrayref of objects

    $aggregator->indexer->start;
    $aggregator->crawl( $data );
    $aggregator->indexer->finish;

DESCRIPTION

Top

SWISH::Prog::Aggregator::Object is designed for providing full-text search for your Perl objects with Swish-e.

Since SWISH::Prog::Aggregator::Object inherits from SWISH::Prog::Aggregator, read that documentation first. Any overridden methods are documented here.

If it seems odd at first to think of indexing objects, consider the advantages:

sorting

Particularly for scalar method values, time for sorting objects by method value is greatly decreased thanks to Swish-e's pre-sorted properties.

SWISH::API::Object integration

If you use SWISH::API::Object, you can get a Storable-like freeze/thaw effect with SWISH::Prog::Aggregator::Object.

caching

If some methods in your objects take a long while to calculate values, but don't change often, you can use Swish-e to cache those values, similar to the Cache::* modules, but in a portable, fast index.

METHODS

Top

new( opts )

Create new aggregator object.

opts may include:

methods

The methods param takes an array ref of method names. Each method name will be called on each object in crawl(). Each method name will also be stored as a PropertyName in the Swish-e index, unless you explicitly create a SWISH::Prog::Config object that that defines your PropertyNames.

class

The name of the class each object belongs to. The class value will be stored in the index itself for later use with SWISH::API::Object (or for your own amusement).

If not specified, the first object crawl()ed will be tested with the blessed() function from Scalar::Util.

title

Which method to use as the swishtitle value. Defaults to title.

url

Which method to use as the swishdocpath value. Defaults to url.

modtime

Which method to use as the swishlastmodified value. Defaults to Perl built-in time().

serial_format

Which format to use in serialize(). Default is json. You can also use yaml. If you don't like either of those, subclass SWISH::Prog::Aggregator::Object and override serialize() to provide your own format.

init

Initialize object. This overrides SWISH::Prog::Aggregator init() base method.

crawl( data )

Index your objects.

data should either be an array ref of objects, or an iterator object with a next method. If data is an iterator, it will be used like:

 while( my $object = $data->next )
 {
     $aggregator->method_to_index( $object );
 }

Returns number of objects indexed.

get_doc( object )

Returns a doc_class() instance representing object.

serialize( object, method_name )

Returns a serialized (stringified) version of the return value of method_name. If the return value is already a scalar string (i.e., if ref() returns false) then the return value is returned untouched. Otherwise, the return value is serialized with either JSON or YAML, depending on how you configured serial_format in new().

If you subclass SWISH::Prog::Aggregator::Object, then you can (of course) return whatever serialized format you prefer.

REQUIREMENTS

Top

SWISH::Prog, YAML::Syck, JSON::Syck

AUTHOR

Top

Peter Karman, <perl@peknet.com>

BUGS

Top

Please report any bugs or feature requests to bug-swish-prog at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=SWISH-Prog. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

Top

You can find documentation for this module with the perldoc command.

    perldoc SWISH::Prog




You can also look for information at:

* Mailing list

http://lists.swish-e.org/listinfo/users

* RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=SWISH-Prog

* AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/SWISH-Prog

* CPAN Ratings

http://cpanratings.perl.org/d/SWISH-Prog

* Search CPAN

http://search.cpan.org/dist/SWISH-Prog/

COPYRIGHT AND LICENSE

Top

SEE ALSO

Top

http://swish-e.org/


SWISH-Prog documentation Contained in the SWISH-Prog distribution.
package SWISH::Prog::Aggregator::Object;

use strict;
use warnings;
use base qw( SWISH::Prog::Aggregator );

use Carp;
use YAML::Syck ();
use JSON       ();
use SWISH::Prog::Utils;
use Scalar::Util qw( blessed );

__PACKAGE__->mk_accessors(
    qw( methods class title url modtime serial_format ));

our $VERSION = '0.51';

my $XMLer = Search::Tools::XML->new();    # included in Utils

sub init {
    my $self = shift;
    $self->SUPER::init(@_);

    $self->{title}         ||= 'title';
    $self->{url}           ||= 'url';
    $self->{modtime}       ||= 'modtime';
    $self->{serial_format} ||= 'json';

    unless ( $self->{methods} ) {
        croak "methods required";
    }

    # set up the config object
    my $config = $self->{indexer}->{config};

    ( my $class_meta = $self->class ) =~ s/\W/\./g;
    $self->{_class_meta} = $class_meta;

    # make urls find-able (really should adjust WordCharacters too...)
    $config->MaxWordLimit(256) unless $config->MaxWordLimit;

    # similar to DBI, we alias top-level tag
    # so all words are find-able via swishdefault
    $config->MetaNameAlias( 'swishdefault ' . $class_meta )
        unless $config->MetaNameAlias;
    $config->MetaNames( @{ $self->methods } ) unless @{ $config->MetaNames };

    $config->PropertyNames( @{ $self->methods } )
        unless @{ $config->PropertyNames };

    # IMPORTANT to do this because whitespace matters in YAML
    # NOTE that due to swish-e cache buffering, YAML fields
    # that are longer than 10k can get seriously messed up.
    # this is a swish-e bug that should be fixed.
    $config->PropertyNamesNoStripChars( @{ $self->methods } )
        unless @{ $config->PropertyNamesNoStripChars };

    $config->IndexDescription(
        join(
            ' ', 'class:' . $self->class, 'format:' . $self->serial_format
        )
    ) unless $config->IndexDescription;

}

sub crawl {
    my $self    = shift;
    my $data    = shift;
    my $indexer = $self->indexer;

    # IMPORTANT! that this not be undef since url defaults to it.
    $self->{count} = 0;

    if ( ref($data) eq 'ARRAY' ) {

        $self->{class} ||= blessed( $data->[0] );
        for my $o (@$data) {
            $indexer->process( $self->get_doc($o) );
            $self->_increment_count;
        }

    }
    elsif ( ref($data) && $data->can('next') ) {
        my $first = $data->next;
        $self->{class} ||= blessed($first);
        $indexer->process( $self->get_doc($first) );

        while ( my $o = $data->next ) {
            $indexer->process( $self->get_doc($o) );
            $self->_increment_count;
        }
    }
    else {
        croak "\$data $data doesn't look like it's in the expected format";
    }

    return $self->{count};
}

sub get_doc {
    my $self = shift;
    my $object = shift or croak "need object";

    my $titlemeth   = $self->{title};
    my $urlmeth     = $self->{url};
    my $modtimemeth = $self->{modtime};

    my $title
        = $object->can($titlemeth) ? $object->$titlemeth : '[ no title ]';

    my $url
        = $object->can($urlmeth)
        ? $object->$urlmeth
        : $self->{count};

    my $modtime
        = $object->can($modtimemeth)
        ? $object->$modtimemeth
        : time();

    my $xml = $self->_obj2xml( $self->{_class_meta}, $object, $title );

    my $doc = $self->doc_class->new(
        content => $xml,
        url     => $url,
        modtime => $modtime,
        parser  => 'XML*',
        type    => 'application/x-object',
        data    => $object
    );

    $self->debug and print $doc;

    return $doc;
}

sub _obj2xml {
    my ( $self, $class, $o, $title ) = @_;

    my $xml
        = $XMLer->start_tag($class)
        . "<swishtitle>"
        . $XMLer->utf8_safe($title)
        . "</swishtitle>";

    for my $m ( @{ $self->methods } ) {
        my $v = $self->serialize( $o, $m );

        my @x = (
            $XMLer->start_tag($m), $XMLer->utf8_safe($v), $XMLer->end_tag($m)
        );

        $xml .= join( '', @x );
    }
    $xml .= $XMLer->end_tag($class);

    $self->debug and print STDOUT $xml . "\n";

    return $xml;
}

sub serialize {
    my $self = shift;
    my ( $o, $m ) = @_;
    my $v = $o->$m;
    unless ( ref $v ) {
        return $v;
    }
    else {
        if ( $self->serial_format eq 'json' ) {
            return JSON->new->convert_blessed(1)->allow_blessed(1)
                ->encode($v);
        }
        elsif ( $self->serial_format eq 'yaml' ) {
            return YAML::Syck::Dump($v);
        }
        else {
            croak "unknown serial_format: " . $self->serial_format;
        }
    }

}

1;

__END__