WWW::CheckSite::Spider - A base class for spidering the web


WWW-CheckSite documentation  | view source Contained in the WWW-CheckSite distribution.

Index


NAME

Top

WWW::CheckSite::Spider - A base class for spidering the web

SYNOPSIS

Top

    use WWW::CheckSite::Spider;

    my $sp = WWW::CheckSite::Spider->new(
         uri      => 'http://www.test-smoke.org',
    );

    while ( my $page = $sp->get_page ) {
        # $page is a hashref with basic information
    }

or to spider a site behind HTTP basic authentication:

    package BA_Mech;
    use base 'WWW::Mechanize';

    sub get_basic_credentials { ( 'abeltje', '********' ) }

    package main;
    use WWW::CheckSite::Spider;

    my $sp = WWW::CheckSite::Spider->new(
         ua_class => 'BA_Mech',
         uri      => 'http://your.site.with.ba/',
    );

    while ( my $page = $sp->get_page ) {
        # $page is a hashref with basic information
    }




DESCRIPTION

Top

This module implements a basic web-spider, based on WWW::Mechanize. It takes care of putting pages on the "still-to-fetch" stack. Only uri's with the same origin will be stacked, taking the robots-rules on the server into account.

CONSTATNTS & EXPORTS

Top

The following constants ar exported on demand with the :const tag.

WCS_UNKNOWN
WCS_FOLLOWED
WCS_SPIDERED
WCS_TOSPIDER
WCS_TOFOLLOW
WCS_NOCONTENT
WCS_OUTSCOPE

METHODS

Top

WWW::CheckSite::Spider->new( %opts )

Currently supported options (the rest will be set but not used!):

* uri => <start_uri> || <\@start_uri> [mandatory]
* ua_class => by default WWW::Mechanize
* exclude => <exclude_re> (qr/[#?].*$/)
* myrules => <\@disallow>
* lang => languages to pass to Accept-Language: header

$spider->get_page

Fetch the page and do some book keeping. It returns the result of $pider->process_page().

$spider->process_page( $uri )

Override this method to make the spider do something useful. By default it returns:

* org_uri Used for the request
* ret_uri The uri returned by the server
* depth The depth in the browse tree
* status The return status from server
* success shortcut for status == 200
* is_html shortcut for ct eq 'text/html'
* title What's in the <TITLE></TITLE> section
* ct The content-type

$spider->strip_uri( $uri )

Strip the fragment bit of the $uri.

USERAGENT METHODS

Top

$spider->agent

Retruns a standard name for this UserAgent.

$spider->init_agent

Initialise the agent that is used to fetch pages. The default class is WWW::Mechanize but any class that has the same methods will do.

The ua_class needs to support the following methods (see WWW::Mechanize for more information about these):

new
get
base
uri
status
success
ct
is_html
title
HEAD (for WWW::CheckSite::Validator)
content (for WWW::CheckSite::Validator)
images (for WWW::CheckSite::Validator)

$spider->current_agent

Return the current user agent.

$spider->new_agent

Create a new agent and return it.

ROBOTRULES METHODS

Top

The Spider uses the robot rules mechanism. This means that it will always get the /robots.txt file from the root of the webserver to see if we are allowed (actually "not disallowed") to access pages as a robot.

You can add rules for disallowing pages by specifying a list of lines in the robots.txt syntax to @{ $self->{myrules} }.

$spider->more_rrules( $url )

Check to see if the robots.txt file for this $url has already been loaded. If not, fetch the file and add the rules to the $self->{_r_rules} object.

$spider->uri_ok( $uri )

This will determine whether this uri should be spidered. Rules are simple:

* Has the same base uri as the one we started with
* Is not excluded by the $self->{exclude} regex.
* Is not excluded by robots.txt mechanism

$spider->allowed( $uri )

Checks the uri against the robotrules.

$spider->init_robotrules( )

This will setup a <WWW::RobotRules> object. @{$self->{myrules } is used to add rules and should be in the RobotRules format. These rules are added to the ones found in robots.txt.

$spider->current_rrules

Returns the current RobotRules object.

AUTHOR

Top

Abe Timmerman, <abeltje@cpan.org>

BUGS

Top

Please report any bugs or feature requests to bug-www-checksite@rt.cpan.org, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

COPYRIGHT & LICENSE

Top


WWW-CheckSite documentation  | view source Contained in the WWW-CheckSite distribution.