SWISH::Prog::Aggregator::Spider - web aggregator


SWISH-Prog documentation  | view source Contained in the SWISH-Prog distribution.

Index


NAME

Top

SWISH::Prog::Aggregator::Spider - web aggregator

SYNOPSIS

Top

 use SWISH::Prog::Aggregator::Spider;
 my $spider = SWISH::Prog::Aggregator::Spider->new(
        indexer => SWISH::Prog::Indexer->new
 );

 $spider->indexer->start;
 $spider->crawl( 'http://swish-e.org/' );
 $spider->indexer->finish;

DESCRIPTION

Top

SWISH::Prog::Aggregator::Spider is a web crawler similar to the spider.pl script in the Swish-e 2.4 distribution. Internally, SWISH::Prog::Aggregator::Spider uses WWW::Mechanize to the hard work. See SWISH::Prog::Aggregator::Spider::UA.

METHODS

Top

See SWISH::Prog::Aggregator

new( params )

All params have their own get/set methods too. They include:

use_md5

Flag as to whether each URI's content should be fingerprinted and compared. Useful if the same content is available under multiple URIs and you only want to index it once.

uri_cache

Get/set the SWISH::Prog::Cache-derived object used to track which URIs have been fetched already.

md5_cache

If use_md5() is true, this SWISH::Prog::cache-derived object tracks the URI fingerprints.

queue

Get/set the SWISH::Prog::Queue-derived object for tracking which URIs still need to be fetched.

ua

Get/set the SWISH::Prog::Aggregagor::Spider::UA object.

max_depth

How many levels of links to follow. NOTE: This value describes the number of links from the first argument passed to crawl.

delay

Get/set the number of seconds to wait between making requests. Default is 5 seconds (a very friendly delay).

timeout

Get/set the number of seconds to wait before considering the remote server unresponsive. The default is 10.

init

Initializes a new spider object. Called by new().

uri_ok( uri )

Returns true if uri is acceptable for including in an index. The 'ok-ness' of the uri is based on it's base, robot rules, and the spider configuration.

get_doc

Returns the next URI from the queue() as a SWISH::Prog::Doc object, or the error message if there was one.

Returns undef if the queue is empty or max_depth() has been reached.

crawl( uri )

Implements the required crawl() method. Recursively fetches uri and its child links to a depth set in max_depth().

AUTHOR

Top

Peter Karman, <perl@peknet.com>

BUGS

Top

Please report any bugs or feature requests to bug-swish-prog at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=SWISH-Prog. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

Top

You can find documentation for this module with the perldoc command.

    perldoc SWISH::Prog




You can also look for information at:

* Mailing list

http://lists.swish-e.org/listinfo/users

* RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=SWISH-Prog

* AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/SWISH-Prog

* CPAN Ratings

http://cpanratings.perl.org/d/SWISH-Prog

* Search CPAN

http://search.cpan.org/dist/SWISH-Prog/

COPYRIGHT AND LICENSE

Top

SEE ALSO

Top

http://swish-e.org/


SWISH-Prog documentation  | view source Contained in the SWISH-Prog distribution.