WWW::ProxyChecker - check whether or not proxy servers are alive


WWW-ProxyChecker documentation Contained in the WWW-ProxyChecker distribution.

Index


Code Index:

NAME

Top

WWW::ProxyChecker - check whether or not proxy servers are alive

SYNOPSIS

Top

    use strict;
    use warnings;
    use WWW::ProxyChecker;

    my $checker = WWW::ProxyChecker->new( debug => 1 );

    my $working_ref= $checker->check( [ qw(
                http://221.139.50.83:80
                http://111.111.12.83:8080
                http://111.111.12.183:3218
                http://111.111.12.93:8080
            )
        ]
    );

    die "No working proxies were found\n"
        if not @$working_ref;

    print "$_ is alive\n"
        for @$working_ref;

DESCRIPTION

Top

The module provides means to check whether or not HTTP proxies are alive. The module was designed more towards "quickly scanning through to get a few" than "guaranteed or your money back" therefore there is no 100% guarantee that non-working proxies are actually dead and that all of the reported working proxies are actually good.

CONSTRUCTOR

Top

new

    my $checker = WWW::ProxyChecker->new;

    my $checker_juicy = WWW::ProxyChecker->new(
        timeout       => 5,
        max_kids      => 20,
        max_working_per_child => 2,
        check_sites   => [ qw(
                http://google.com
                http://microsoft.com
                http://yahoo.com
                http://digg.com
                http://facebook.com
                http://myspace.com
            )
        ],
        agent   => 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.12)'
                    .' Gecko/20080207 Ubuntu/7.10 (gutsy) Firefox/2.0.0.12',
        debug => 1,
    );

Bakes up and returns a new WWW::ProxyChecker object. Takes a few arguments all of which are optional. Possible arguments are as follows:

timeout

    ->new( timeout => 5 );

Optional. Specifies timeout in seconds to give to LWP::UserAgent object which is used for checking. If a connection to the proxy times out the proxy is considered dead. The lower the value, the faster the check will be done but also the more are the chances that you will throw away good proxies. Defaults to: 5 seconds

agent

    ->new( agent => 'ProxeyCheckerz' );

Optional. Specifies the User Agent string to use while checking proxies. By default will be set to mimic Firefox.

check_sites

    ->new( check_sites => [ qw( http://some_site.com http://other.com ) ] );

Optional. Takes an arrayref of sites to try to connect to through a proxy. Yes! It's evil, saner ideas are more than welcome. Defaults to:

    check_sites   => [ qw(
                http://google.com
                http://microsoft.com
                http://yahoo.com
                http://digg.com
                http://facebook.com
                http://myspace.com
            )
        ],

max_kids

    ->new( max_kids => 20 );

Optional. Takes a positive integer as a value. The module will fork up maximum of max_kids processes to check proxies simultaneously. It will fork less if the total number of proxies to check is less than max_kids. Technically, setting this to a higher value might speed up the overall process but keep in mind that it's the number of simultaneous connections that you will have open. Defaults to: 20

max_working_per_child

    ->new( max_working_per_child => 2 );

Optional. Takes a positive integer as a value. Specifies how many working proxies each sub proccess should find before aborting (it will also abort if proxy list is exhausted). In other words, setting 20 max_kids and max_working_per_child to 2 will give you 40 working proxies at most, no matter how many are in the original list. Specifying undef will get rid of limit and make each kid go over the entire sub list it was given. Defaults to: undef (go over entire sub list)

debug

    ->new( debug => 1 );

Optional. When set to a true value will make the module print out some debugging information (which proxies failed and how, etc). By default not specifies (debug is off)

METHODS

Top

check

    my $working_ref = $checker->check( [ qw(
                http://221.139.50.83:80
                http://111.111.12.83:8080
                http://111.111.12.183:3218
                http://111.111.12.93:8080
            )
        ]
    );

Instructs the object to check several proxies. Returns a (possibly empty) array ref of addresses which the object considers to be alive and working. Takes an arrayref of proxy addresses. The elements of this arrayref will be passed to LWP::UserAgent's proxy() method as:

    $ua->proxy( [ 'http', 'https', 'ftp', 'ftps' ], $proxy );

so you can read the docs for LWP::UserAgent and maybe think up something creative.

alive

    my $last_alive = $checker->alive;

Must be called after a call to check(). Takes no arguments, returns the same arrayref last check() returned.

ACCESSORS/MUTATORS

Top

The module provides an accessor/mutator for each of the arguments in the constructor (new() method). Calling any of these with an argument will set a new value. All of these return a currently set value:

    max_kids
    check_sites
    max_working_per_kid
    timeout
    agent
    debug

See CONSTRUCTOR section for more information about these.

AUTHOR

Top

Zoffix Znet, <zoffix at cpan.org> (http://zoffix.com, http://haslayout.net)

BUGS

Top

Please report any bugs or feature requests to bug-www-proxychecker at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=WWW-ProxyChecker. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

Top

You can find documentation for this module with the perldoc command.

    perldoc WWW::ProxyChecker

You can also look for information at:

* RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=WWW-ProxyChecker

* AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/WWW-ProxyChecker

* CPAN Ratings

http://cpanratings.perl.org/d/WWW-ProxyChecker

* Search CPAN

http://search.cpan.org/dist/WWW-ProxyChecker

COPYRIGHT & LICENSE

Top


WWW-ProxyChecker documentation Contained in the WWW-ProxyChecker distribution.

package WWW::ProxyChecker;

use warnings;
use strict;

our $VERSION = '0.002';
use Carp;
use LWP::UserAgent;
use IO::Pipe;
use base 'Class::Data::Accessor';
__PACKAGE__->mk_classaccessors qw(
    max_kids
    debug
    alive
    check_sites
    max_working_per_kid
    timeout
    agent
);

sub new {
    my $self = bless {}, shift;
    croak "Must have even number of arguments to new()"
        if @_ & 1;

    my %args = @_;
    $args{ +lc } = delete $args{ $_ } for keys %args;

    %args = (
        timeout       => 5,
        max_kids      => 20,
        check_sites   => [ qw(
                http://google.com
                http://microsoft.com
                http://yahoo.com
                http://digg.com
                http://facebook.com
                http://myspace.com
            )
        ],
        agent   => 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.12)'
                    .' Gecko/20080207 Ubuntu/7.10 (gutsy) Firefox/2.0.0.12',

        %args,
    );

    $self->$_( $args{ $_ } ) for keys %args;

    return $self;
}

sub check {
    my ( $self, $proxies_ref ) = @_;

    $self->alive(undef);

    print "About to check " . @$proxies_ref . " proxies\n"
        if $self->debug;

    my $working_ref = $self->_start_checker( @$proxies_ref );

    print @$working_ref . ' out of ' . @$proxies_ref
            . " seem to be alive\n" if $self->debug;

    return $self->alive( $working_ref);
}

sub _start_checker {
    my ( $self, @proxies ) = @_;

    my $n = $self->max_kids;
    $n > @proxies and $n = @proxies;
    my $mod = @proxies / $n;
    my %prox;
    for ( 1 .. $n ) {
        $prox{ $_ } = [ splice @proxies, 0,$mod ]
    }
    push @{ $prox{ $n } }, @proxies; # append any left over addresses

    $SIG{CHLD} = 'IGNORE';
    my @children;
    for my $num ( 1 .. $self->max_kids ) {
        my $pipe = new IO::Pipe;

        if ( my $pid = fork ) { # parent
            $pipe->reader;
            push @children, $pipe;
        }
        elsif ( defined $pid ) { # kid
            $pipe->writer;

            my $ua = LWP::UserAgent->new(
                timeout => $self->timeout,
                agent   => $self->agent,
            );

            my $check_sites_ref = $self->check_sites;
            my $debug = $self->debug;
            my @working;
            for my $proxy ( @{ $prox{ $num } } ) {
                print "Checking $proxy in kid $num\n"
                    if $debug;

                if ( $self->_check_proxy($ua, $proxy, $check_sites_ref) ) {
                    push @working, $proxy;

                    last
                        if defined $self->max_working_per_kid
                            and @working >= $self->max_working_per_kid;
                }
            }
            print $pipe "$_\n" for @working;
            exit;
        }
        else { # error
            carp "Failed to fork kid number $num ($?)";
        }

    }

    my @working_proxies;
    for my $num ( 0 .. $#children ) {
        my $fh = $children[$num];
        while (<$fh>) {
            chomp;
            push @working_proxies, $_;
        }
    }

    return \@working_proxies;
}

sub _check_proxy {
    my ( $self, $ua, $proxy, $sites_ref ) = @_;

    $ua->proxy( [ 'http', 'https', 'ftp', 'ftps' ], $proxy);
    my $response = $ua->get( $sites_ref->[rand @$sites_ref] );
    if ( $response->is_success ) {
        return 1;
    }
    else {
        printf "Failed on $proxy (%s)\n", $response->status_line
            if $self->debug;

        my $response_code = $response->code;
        return 0
            if grep { $response_code eq $_ } qw(407 502 503 403);

        ( my $proxy_no_scheme = $proxy ) =~ s{(?:ht|f)tps?://}{}i;
        return $response->status_line
        =~ /^500 read timeout$|\Q$proxy_no_scheme/ ? 0 : 1;
    }
}

1;
__END__