SWISH::Prog::Aggregator - document aggregation base class


SWISH-Prog documentation Contained in the SWISH-Prog distribution.

Index


Code Index:

NAME

Top

SWISH::Prog::Aggregator - document aggregation base class

SYNOPSIS

Top

 package MyAggregator;
 use strict;
 use base qw( SWISH::Prog::Aggregator );

 sub get_doc {
    my ($self, $url) = @_;

    # do something to create a SWISH::Prog::Doc object from $url

    return $doc;
 }

 sub crawl {
    my ($self, @where) = @_;

    foreach my $place (@where) {

       # do something to search $place for docs to pass to get_doc()

    }
 }

 1;

DESCRIPTION

Top

SWISH::Prog::Aggregator is a base class that defines the basic API for writing an aggregator. Only two methods are required: get_doc() and crawl(). See the SYNOPSIS for the prototypes.

See SWISH::Prog::Aggregator::FS and SWISH::Prog::Aggregator::Spider for examples of aggregators that crawl the filesystem and web, respectively.

METHODS

Top

init

Set object flags per SWISH::Prog::Class API. These are also accessors, and include:

set_parser_from_type

This will set the parser() value in swish_filter() based on the MIME type of the doc_class() object.

indexer

A SWISH::Prog::Indexer object.

doc_class

The name of the SWISH::Prog::Doc-derived class to use in get_doc(). Default is SWISH::Prog::Doc.

swish_filter_obj

A SWISH::Filter object. If not passed in new() one is created for you.

test_mode

Dry run mode, just prints info on stderr but does not build index.

filter

Value should be a CODE ref. This is passed through to set_filter(); there is no filter mutator method.

ok_if_newer_than

Value should be a Unix timestamp (epoch seconds). Default is undef. If set, aggregators should skip files that have a modification time older than the timestamp.

You may get/set the ok_if_newer_than value with the ok_if_newer_than() attribute method, but use set_ok_if_newer_than() to include validation of the supplied timestamp value.

progress( Term::ProgressBar object )

Get/set a progress object. The default used in the examples/swish3 script is Term::ProgressBar. If set, it will be incremented just like count() is.

config

Returns the SWISH::Prog::Config object from the Indexer being used. This is a read-only method (accessor not mutator).

count

Returns the total number of doc_class() objects returned by get_doc().

crawl( @where )

Override this method in your subclass. It does the aggregation, and passes each doc_class() object from get_doc() to indexer->process().

get_doc( url )

Override this method in your subclass. Should return a doc_class() object.

swish_filter( doc_class_object )

Passes the content() of the SPD object through SWISH::Filter and transforms it to something index-able. Returns the doc_class_object, filtered.

NOTE: This method should be called by all aggregators after get_doc() and before passing to the indexer().

See the SWISH::Filter documentation.

set_filter( code_ref )

Use code_ref as the doc_class filter. This method called by init() if filter param set in constructor.

set_ok_if_newer_than( timestamp )

Set the ok_if_newer_than attribute. timestamp should be a Unix epoch value.

AUTHOR

Top

Peter Karman, <perl@peknet.com>

BUGS

Top

Please report any bugs or feature requests to bug-swish-prog at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=SWISH-Prog. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

Top

You can find documentation for this module with the perldoc command.

    perldoc SWISH::Prog




You can also look for information at:

* Mailing list

http://lists.swish-e.org/listinfo/users

* RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=SWISH-Prog

* AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/SWISH-Prog

* CPAN Ratings

http://cpanratings.perl.org/d/SWISH-Prog

* Search CPAN

http://search.cpan.org/dist/SWISH-Prog/

COPYRIGHT AND LICENSE

Top

SEE ALSO

Top

http://swish-e.org/


SWISH-Prog documentation Contained in the SWISH-Prog distribution.
package SWISH::Prog::Aggregator;
use strict;
use warnings;
use base qw( SWISH::Prog::Class );
use Carp;
use SWISH::Prog::Utils;
use SWISH::Filter;
use SWISH::Prog::Doc;
use Scalar::Util qw( blessed );
use Data::Dump qw( dump );

our $VERSION = '0.51';

__PACKAGE__->mk_accessors(
    qw(
        set_parser_from_type
        indexer
        doc_class
        swish_filter_obj
        test_mode filter
        ok_if_newer_than
        progress
        )
);
__PACKAGE__->mk_ro_accessors(qw( count ));

sub init {
    my $self   = shift;
    my %arg    = @_;
    my $filter = delete $arg{filter};
    $self->SUPER::init(%arg);
    $self->{verbose} ||= 0;
    $self->{__progress_so_far} = 0;
    $self->{__progress_next}   = 0;

    if (   !$self->{indexer}
        or !blessed( $self->{indexer} )
        or !$self->{indexer}->isa('SWISH::Prog::Indexer') )
    {
        croak "SWISH::Prog::Indexer-derived object required to crawl()";
    }

    $self->{doc_class} ||= 'SWISH::Prog::Doc';
    $self->{swish_filter_obj} ||= SWISH::Filter->new;

    if ($filter) {
        $self->set_filter($filter);
    }

}

sub config {
    return shift->{indexer}->config;
}

sub crawl {
    my $self = shift;
    croak ref($self) . " does not implement crawl()";
}

sub get_doc {
    my $self = shift;
    croak ref($self) . " does not implement get_doc()";
}

sub swish_filter {
    my $self = shift;
    my $doc  = shift;
    unless ( $doc && blessed($doc) && $doc->isa('SWISH::Prog::Doc') ) {
        croak "SWISH::Prog::Doc-derived object required";
    }

    $doc->parser( $SWISH::Prog::Utils::ParserTypes{ $doc->type }
            || $SWISH::Prog::Utils::ParserTypes{default} )
        if $self->set_parser_from_type;

    if ( $self->{swish_filter_obj}->can_filter( $doc->type ) ) {
        my $content = $doc->content;
        my $url     = $doc->url;
        my $type    = $doc->type;
        my $f       = $self->{swish_filter_obj}->convert(
            document     => \$content,
            content_type => $type,
            name         => $url
        );

        if (   !$f
            || !$f->was_filtered
            || $f->is_binary )    # is is_binary necessary?
        {
            warn "skipping $url - filtering error\n";
            return;
        }

        if ( $self->debug > 1 ) {
            warn "$url [$type] was filtered\n";
            warn "content changed\n" if $doc->content ne ${ $f->fetch_doc };
        }

        $doc->content( ${ $f->fetch_doc } );

        # leave type and parser as-is
        # since we want to store original mime in indexer
        # TODO what about parser ?
        # since type will have changed ( $f->content_type ) from original
        # the parser type might also have changed?

        $doc->parser( $f->swish_parser_type ) if $self->set_parser_from_type;

    }
    else {

        if ( $self->debug ) {
            warn sprintf( "No filter applied to %s - cannot filter %s\n",
                $doc->url, $doc->type );
            warn sprintf( " available filter: %s\n", $_ )
                for $self->{swish_filter_obj}->filter_list;
        }

    }

}

sub set_filter {
    my $self   = shift;
    my $filter = shift;
    unless ( ref($filter) eq 'CODE' ) {
        croak "filter must be a CODE ref";
    }

    # cheat a little by using this code instead of the default
    # method in doc_class
    {
        no strict 'refs';
        no warnings 'redefine';
        *{ $self->{doc_class} . '::filter' } = $filter;
    }

}

sub set_ok_if_newer_than {
    my $self = shift;
    my $ts = shift || 0;
    if ( $ts =~ m/\D/ ) {
        croak "timestamp should be an integer";
    }
    $self->ok_if_newer_than($ts);
}

#
# private method
#

sub _increment_count {
    my $self = shift;
    my $count = shift || 1;
    $self->{count} += $count;
    if ( $self->{progress} ) {
        $self->{__progress_so_far} += $count;
        if ( $self->{__progress_so_far} >= $self->{__progress_next} ) {
            $self->{__progress_next}
                = $self->{progress}->update( $self->{__progress_so_far} );
        }
    }
    return $self;
}

1;

__END__