SWISH::Prog::Aggregator::DBI - index DB records with Swish-e


SWISH-Prog documentation Contained in the SWISH-Prog distribution.

Index


Code Index:

NAME

Top

SWISH::Prog::Aggregator::DBI - index DB records with Swish-e

SYNOPSIS

Top

    use SWISH::Prog::Aggregator::DBI;
    use Carp;

    my $aggregator = SWISH::Prog::Aggregator::DBI->new(
        db => [
            "DBI:mysql:database=movies;host=localhost;port=3306",
            'some_user', 'some_secret_pass',
            {
                RaiseError  => 1,
                HandleError => sub { confess(shift) },
            }
        ],
        schema => {
          'moviesIlike' => {
               title       => {type => 'char', bias => 1},
               synopsis    => {type => 'char', bias => 1},
               year        => {type => 'int',  bias => 1},
               director    => {type => 'char', bias => 1},
               producer    => {type => 'char', bias => 1},
               awards      => {type => 'char', bias => 1},
               date        => {type => 'date', bias => 1},
               swishdescription => { synopsis => 1, producer => 1 },
               swishtitle       => 'title',
          }
        }
        alias_columns   => 1,
        indexer         => SWISH::Prog::Indexer::Native->new,
    );

    $aggregator->crawl();




DESCRIPTION

Top

SWISH::Prog::Aggregator::DBI is a SWISH::Prog::Aggregator subclass designed for providing full-text search for databases.

METHODS

Top

Since SWISH::Prog::Aggregator::DBI inherits from SWISH::Prog::Aggregator, read that documentation first. Any overridden methods are documented here.

new( opts )

Create new aggregator object.

The following opts are required:

db => connect_info

connect_info is passed directly to DBI's connect() method, so see the DBI docs for syntax. If connect_info is a DBI handle object, it is accepted as is. If connect_info is an array ref, it will be dereferenced and passed to connect(). Otherwise it will be passed to connect as is.

schema => db_schema

db_schema is a hashref of table names and column descriptions. Each key should be a table name. Each value should be a hashref of column descriptions, where the key is the column name and the value is a hashref of type and bias. See the SYNOPSIS.

There are two special column names: swishtitle and swishdescription. These are reserved for mapping real column names to Swish-e property names for returning in search results. swishtitle should be the name of a column, and swishdescription should be a hashref of column names to include in the StoreDescription value.

indexer => indexer_obj

A SWISH::Prog::Indexer-derived object.

The following opts are optional:

alias_columns => 0|1

The alias_columns flag indicates whether all columns should be searchable under the default MetaName of swishdefault. The default is 1 (true). This is not the default behaviour of swish-e; this is a feature of SWISH::Prog.

NOTE: The new() method simply inherits from SWISH::Prog::Aggregator, so any params valid for that method are allowed here.

init

See SWISH::Prog::Class. This method does all the setup.

crawl

Create index.

Returns number of rows indexed.

AUTHOR

Top

Peter Karman, <perl@peknet.com>

BUGS

Top

Please report any bugs or feature requests to bug-swish-prog at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=SWISH-Prog. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

Top

You can find documentation for this module with the perldoc command.

    perldoc SWISH::Prog




You can also look for information at:

* Mailing list

http://lists.swish-e.org/listinfo/users

* RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=SWISH-Prog

* AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/SWISH-Prog

* CPAN Ratings

http://cpanratings.perl.org/d/SWISH-Prog

* Search CPAN

http://search.cpan.org/dist/SWISH-Prog/

COPYRIGHT AND LICENSE

Top

SEE ALSO

Top

http://swish-e.org/


SWISH-Prog documentation Contained in the SWISH-Prog distribution.
package SWISH::Prog::Aggregator::DBI;

use strict;
use warnings;
use base qw( SWISH::Prog::Aggregator );
use Carp;
use Data::Dump qw( dump );
use DBI;
use SWISH::Prog::Utils;

__PACKAGE__->mk_accessors(qw( db alias_columns schema ));

our $VERSION = '0.51';

my $XMLer = Search::Tools::XML->new();    # included in Utils

sub init {
    my $self = shift;
    $self->SUPER::init(@_);

    # verify DBI connection
    if ( defined( $self->db ) ) {

        if ( ref( $self->db ) eq 'ARRAY' ) {
            $self->db( DBI->connect( @{ $self->{db} } ) );
        }
        elsif ( ref( $self->db ) && $self->db->isa('DBI::db') ) {

            # do nothing
        }
        else {
            $self->db( DBI->connect( $self->db ) );
        }
    }
    else {
        croak "db required";
    }

    # verify schema
    if ( defined $self->schema ) {

        my $schema = $self->schema;
        unless ( ref($schema) eq 'HASH' ) {
            croak "schema must be a hashref";
        }
        for my $table ( keys %$schema ) {
            my $cols = $schema->{$table};
            unless ( ref($cols) eq 'HASH' ) {
                croak "column descriptions must be a hashref";
            }
            for my $colname ( keys %$cols ) {
                my $desc = $cols->{$colname};
                unless ( ref($desc) eq 'HASH' ) {
                    croak "$colname description must be a hashref";
                }
                $desc->{type}
                    ||= 'char'; # TODO auto-make property types based on this.
                $desc->{bias} ||= 1;
            }
        }
    }
    else {
        croak "schema required";
    }

    $self->{alias_columns} = 1 unless exists $self->{alias_columns};

    # unless metanames are defined, use all the column names from schema
    my $m = $self->config->MetaNames;
    unless (@$m) {
        for my $table ( keys %{ $self->{schema} } ) {
            my $columns = $self->{schema}->{$table};
            my %ranks;
            push( @{ $ranks{ $columns->{$_}->{bias} } }, $_ )
                for sort keys %$columns;

            for my $rank ( keys %ranks ) {
                $self->config->MetaNamesRank(
                    "$rank " . join( ' ', @{ $ranks{$rank} } ), 1 );
            }
        }
    }

    # alias the top level tags to that default search
    # will match any metaname in any table
    if ( $self->alias_columns ) {
        $self->config->MetaNameAlias(
            'swishdefault '
                . join( ' ',
                map { '_' . $_ . '_row' }
                sort keys %{ $self->{schema} } ),
            1    # always append
        );
    }

    # add 'table' metaname
    $self->config->MetaNames('table');

    # save all row text in the swishdescription property for excerpts
    $self->config->StoreDescription('XML* <_desc>');

}

sub crawl {
    my $self = shift;

    my @tables = sort keys %{ $self->{schema} };

T: for my $table (@tables) {

        my $table_info = $self->{schema}->{$table};

        # which columns to index
        my @cols = sort keys %$table_info;

        # special col names
        my $desc  = delete( $table_info->{swishdescription} ) || {};
        my $title = delete( $table_info->{swishtitle} )       || '';

        # TODO test other dbs besides mysql for quoting etc.
        my $c = $self->_do_table(
            name  => $table . ".index",
            sql   => "SELECT `" . join( '`,`', @cols ) . "` FROM $table",
            table => $table,
            desc  => $desc,
            title => $title,
        );
        $self->_increment_count($c);
    }

    return $self->{count};
}

sub _do_table {
    my $self = shift;
    my %opts = @_;

    if ( !$opts{sql} ) {
        croak "need SQL statement to index with";
    }

    $opts{table} ||= '';
    $opts{title} ||= '';

    my $counter = 0;
    my $indexer = $self->indexer;

    my $sth = $self->db->prepare( $opts{sql} )
        or croak "DBI prepare() failed: " . $self->db->errstr;
    $sth->execute or croak "SELECT failed " . $sth->errstr;

    while ( my $row = $sth->fetchrow_hashref ) {

        my $title = $row->{ $opts{title} } || '[ no title ]';

        my $xml = $self->_row2xml( $XMLer->tag_safe( $opts{table} ),
            $row, $title, \%opts );

        my $doc = $self->doc_class->new(
            content => $xml,
            url     => ++$counter,
            modtime => time(),
            parser  => 'XML*',
            type    => 'application/x-swish-dbi',    # TODO ??
            data    => $row
        );

        $indexer->process($doc);
    }

    $sth->finish;

    return $counter;

}

sub _row2xml {
    my ( $self, $table, $row, $title, $opts ) = @_;

    my $xml
        = "<_${table}_row>"
        . "<table>"
        . $table
        . "</table>"
        . "<swishtitle>"
        . $XMLer->utf8_safe($title)
        . "</swishtitle>"
        . "<_body>";

    for my $col ( sort keys %$row ) {
        my @x = (
            $XMLer->start_tag($col),
            $XMLer->utf8_safe( $row->{$col} ),
            $XMLer->end_tag($col)
        );

        if ( $opts->{desc}->{$col} ) {
            unshift( @x, '<_desc>' );
            push( @x, '</_desc>' );
        }

        $xml .= join( '', @x );
    }
    $xml .= "</_body></_${table}_row>";

    #$self->debug and warn $xml . "\n";

    return $xml;
}

1;

__END__