LWP::UserAgent::ProxyHopper::Base - base class for LWP::UserAgent based modules which want to proxy-hop their requests


LWP-UserAgent-ProxyHopper-Base documentation Contained in the LWP-UserAgent-ProxyHopper-Base distribution.

Index


Code Index:

NAME

Top

LWP::UserAgent::ProxyHopper::Base - base class for LWP::UserAgent based modules which want to proxy-hop their requests

SYNOPSIS

Top

    package LWP::UserAgent::Prox;

    use base 'LWP::UserAgent';
    use base 'LWP::UserAgent::ProxyHopper::Base';

    package main;

    use strict;
    use warnings;

    my $ua = LWP::UserAgent::Prox->new( agent => 'fox', timeout => 2);

    $ua->proxify_load( debug => 1 );

    for ( 1..10 ) {
        my $response = $ua->proxify_get('http://www.privax.us/ip-test/');

        if ( $response->is_success ) {
            my $content = $response->content;
            if ( my ( $ip ) = $content
                =~ m|<p>.+?IP Address:\s*</strong>\s*(.+?)\s+|s
            ) {
                printf "\n\nSucces!!! \n%s\n", $ip;
            }
            else {
                printf "Response is successfull but seems like we got a wrong "
                        . " page... here is what we got:\n%s\n", $content;
            }
        }
        else {
            printf "\n[SCRIPT] Network error: %s\n", $response->status_line;
        }
    }

DESCRIPTION

Top

The module is a base class for LWP::UserAgent based modules which want to proxy-hop their requests. In other words each request can be sent out from different proxy servers. Originally, this module was ment to be released as LWP::UserAgent::ProxyHopper but I figured it would be more useful as a base class.

WHAT'S IN IT?

Top

By adding use base 'LWP::UserAgent::ProxyHopper::Base'; to your code it should be possible to enable extra functionality this base class provides without trouble. Your code should be a subclass of LWP::UserAgent or at least properly support the proxy() and one or more of LWP::UserAgent's request methods returning HTTP::Response objects.

HOW GOOD IS IT?

Top

Don't get your hopes up too high... unless you can feed the module 100% working and fast proxies. Even though the module does some basic checks on whether the request succeeded and blacklists proxies that appear to be real bad there is still quite a good chance that either (a) your request will timeout after several tries or worse: (b) your request will succeed but will return not what you would expect it to as some proxies tend to drop garbage on you. Depending on settings your mileage will vary, it's speed for quality trade off.

HOW IT WORKS

Top

The module fetches a list of proxy servers (see proxify_load() method) when one of proxify_*() request methods is called it will get a proxy from the list and try to make your request with the proxy in use. If request succeeds it will check for a couple of "this is not what you wanted" proxies and retry the request with a different proxy if that the case. If this check did not raise any suspicion the result (HTTP::Response object) will be returned back to you and proxy which was used will be put into a "working" list. If the request failed the module will do a basic check on the return status code and decide whether to blacklist proxy into a "bad" list or "real_bad" list after which it will retry. The number of times it will retry depends on retry setting to proxify_load() method.

When the original proxy list is exhausted the module will make a new list out of proxies which it previously listed as "working", if that fails the "bad" list which might have working proxies. The "real_bad" list will never be used. If both "working" and "bad" lists do not have any proxies left the module will call proxify_load() automatically with the same arguments you used it with the last time, therefore your program can live long with just one call to proxify_load() during startup.

PROVIDED METHODS

Top

All public methods are prefixed with proxify_ all private methods are prefixed with _proxify_.

proxify_load

    $your_ua->proxify_load; # plain defaults

    $your_ua->proxify_load(  # juicy override
        freeproxylists  => 1,
        plan_b          => 1,
        proxy4free      => 0,
        timeout         => 20,
        debug           => 0,
        retries         => 5,
        extra_proxies   => [],
        schemes         => [ 'http', 'ftp' ],
        get_list_args   => {
            freeproxylists  => [ type => 'anonymous' ],
            proxy4free      => [ [2,3] ],
        },
    );

Instructs the object to load up a list of proxies. You must call this method at least once before calling any other proxify_* request methods. The return value is an arrayref of proxy addresses in a form "http://122.122.122.122:8080/". Will croak() if after trying to fetch proxy lists and after adding extra_proxies (see below) the proxy list is still empty. The method takes quite a bit of arguments, all of which are given in a key/value fashion. All of them are optional. Possible argumens are as follows:

freeproxylists

    $your_ua->proxify_load( freeproxylists => 1 );

Optional. The module uses WWW::FreeProxyLists::Com and WWW::Proxy4FreeCom modules to get the proxy list. If you set freeproxylists argument to a false value the module will not attempt to load any proxies from http://freeproxylists.com/ website. Defaults to: 1

proxy4free

    $your_ua->proxify_load( proxy4free => 0 );

Optional. The module uses WWW::FreeProxyLists::Com and WWW::Proxy4FreeCom modules to get the proxy list. If you set proxy4free argument to a false value (which is the default) the module will not attempt to load any proxies from http://www.proxy4free.com/ website. Defaults to: 0

plan_b

    $your_ua->proxify_load( plan_b => 1 );

Optional. When set to a true value will enable a "Plan B" mechanism. In other words, when plan_b and freeproxylists both set to true values and the fetch from http://freeproxylists.com/ did not give us any proxies the module will fetch a list from http://www.proxy4free.com/ website irrelevant of whether or not proxy4free is set to a true value. In other words, this is sort of a fallback thing in case http://freeproxylists.com is down when proxy4free is set to a false value to speedup proxy list loading process. Defaults to: 1 (enabled)

timeout

    $your_ua->proxify_load( timeout => 20 );

Optional. Takes a positive integer value which will be passed to WWW::FreeProxyLists::Com and WWW::Proxy4FreeCom constructors as a timeout argument. In other words, this specifies the timeout for proxy list fetching. Defaults to: 20

retries

    $your_ua->proxify_load( retries => 5 );

Optional. This argument specifies how many times the module should retry the proxy_* requests if they doesn't look as successfull ones. Generally, setting the retries argument to a higher value will yield to more reliable requests but will also slow down the request process. See HOW IT WORKS section about to get the idea when the module will retry the request. Defaults to: 5.

extra_proxies

    $your_ua->proxify_load( extra_proxies => [] );

Optional. Takes an arrayref of proxy addresses in a format acceptable to LWP::UserAgent's proxy() method. These will be the extra proxies to use which you can provide. Basically you can set freeproxylists and plan_b arguments to false values and stuff your own proxies into extra_proxies arrayref in which case the module will not even attempt to fetch any lists from proxy list sites (i.e. the loading will be way faster). Defaults to: [] (no extra proxies)

schemes

    $your_ua->proxify_load( schemes => [ 'http', 'ftp' ] );

    $your_ua->proxify_load( schemes => 'ftp' );

Optional. Specifies the first argument to pass to LWP::UserAgent's proxy() method (i.e. the schemes to proxy for). Note: any other schemes besides 'http' were not tested and might not even work with the proxy lists the module fetches by default. Defaults to: http

get_list_args

    $your_ua->proxify_load(
        get_list_args   => {
            freeproxylists  => [ type => 'anonymous' ],
            proxy4free      => [ [1,2] ],
        },
    );

Optional. Here you have a chance to specify specific arguments to get_list() methods of WWW::FreeProxyLists::Com and WWW::Proxy4FreeCom modules used under the hood. The get_list_args takes a hashref with two keys as a value. The keys must be freeproxylists and proxy4free values of which must be arrayrefs with arguments to give to get_list() methods of respecive modules.

debug

    $your_ua->proxify_load( debug => 0 );

Optional. When set to a true value will make the module carp() out some debugging info (including the time when proccessing of any proxify_* request methods). Defaults to: 0

proxify_get

    my $response = $your_ua->proxify_get('http://something.com/');

Must be called after a successfull call to proxify_load() method. The method is the same as LWP::UserAgent's get() method except proxify_get() will switch proxies before attempting the request.

proxify_post

    my $response = $your_ua->proxify_post('http://something.com/');

Must be called after a successfull call to proxify_load() method. The method is the same as LWP::UserAgent's post() method except proxify_post() will switch proxies before attempting the request. Note: during my tests a lot (almost all) proxies from http://www.freeproxylist.com/ did not permit POST requests. You might have better luck with setting proxy4free to a true value disabling freeproxylists argument and setting higher retries argumnet (see proxify_load() method above),

proxify_request

    my $response = $your_ua->proxify_request( $req_obj );

Must be called after a successfull call to proxify_load() method. The method is the same as LWP::UserAgent's request() method except proxify_request() will switch proxies before attempting the request.

proxify_head

    my $response = $your_ua->proxify_head('http://something.com/');

Must be called after a successfull call to proxify_load() method. The method is the same as LWP::UserAgent's head() method except proxify_head() will switch proxies before attempting the request.

proxify_mirror

    my $response = $your_ua->proxify_mirror(
        'http://something.com/file.tar.gz',
        'here.tar.gz',
    );

Must be called after a successfull call to proxify_load() method. The method is the same as LWP::UserAgent's mirror() method except proxify_mirror() will switch proxies before attempting the request. Note: use this method with caution as some proxies return an HTML document insted of actual content you requested.

proxify_simple_request

    my $response = $your_ua->proxify_simple_request('http://something.com/');

Must be called after a successfull call to proxify_load() method. The method is the same as LWP::UserAgent's simple_request() method except proxify_simple_request() will switch proxies before attempting the request.

proxify_list

    my $proxies_list_ref = $your_ua->proxify_list;

Must be called after a successfull call to proxify_load() method. Takes no arguments, returns an arrayref of proxies used internally for requests. This list will shrink as more requests are made (until it's depleted and reloaded see HOW IT WORKS section). Note: you can shift, push, etc. on this arrayref to dinamically set what proxies will be used. The proxy to be used on the next proxify_* request is the first element of this arrayref.

proxify_working_list

    my $proxies_working_list_ref = $your_ua->proxify_working_list;

Must be called after a successfull call to proxify_load() method. Takes no arguments, returns an arrayref of proxies listed as "working". See HOW IT WORKS section above for details. Note: you can shift, push, etc. on this arrayref to dinamically change it.

proxify_bad_list

    my $proxies_bad_list_ref = $your_ua->proxify_bad_list;

Must be called after a successfull call to proxify_load() method. Takes no arguments, returns an arrayref of proxies listed as "bad". See HOW IT WORKS section above for details. Note: you can shift, push, etc. on this arrayref to dinamically change it.

proxify_real_bad_list

    my $proxies_real_bad_list_ref = $your_ua->proxify_real_bad_list;

Must be called after a successfull call to proxify_load() method. Takes no arguments, returns an arrayref of proxies listed as "real bad". See HOW IT WORKS section above for details.

proxify_schemes

    my $used_schemes = $your_ua->proxify_schemes;

    $your_ua->proxify_schemes( [ 'http', 'ftp' ] );

Returns a currently used value for the proxify_load() method's schemes argument. If called with an optional argument will use it as a new value. See proxify_load() method above for details. Note: the value will be reset on the next proxify_load() call, which can happen automatically if proxy lists are exhausted. See HOW IT WORKS section for details.

proxify_retries

    my $used_retries = $your_ua->proxify_retries;

    $your_ua->proxify_retries( 10 );

Returns a currently used value for the proxify_load() method's retries argument. If called with an optional argument will use it as a new value. See proxify_load() method above for details. Note: the value will be reset on the next proxify_load() call, which can happen automatically if proxy lists are exhausted. See HOW IT WORKS section for details.

proxify_debug

    my $used_debug = $your_ua->proxify_debug;

    $your_ua->proxify_debug( 1 );

Returns a currently used value for the proxify_load() method's debug argument. If called with an optional argument will use it as a new value. See proxify_load() method above for details. Note: the value will be reset on the next proxify_load() call, which can happen automatically if proxy lists are exhausted. See HOW IT WORKS section for details.

proxify_current

    my $current_proxy = $your_ua->proxify_current;

Takes no arguments, returns a last proxy used in proxify_* request methods. Why is is called "current"? Because it changes several times during the calls to proxify_* request methods depending on the retries argument's setting ( in the proxify_load() method ).

AUTHOR

Top

Zoffix Znet, <zoffix at cpan.org> (http://zoffix.com/, http://haslayout.net/, http://zofdesign.com/)

BUGS

Top

Please report any bugs or feature requests to bug-lwp-useragent-proxyhopper-base at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=LWP-UserAgent-ProxyHopper-Base. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

Top

You can find documentation for this module with the perldoc command.

    perldoc LWP::UserAgent::ProxyHopper::Base

You can also look for information at:

* RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=LWP-UserAgent-ProxyHopper-Base

* AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/LWP-UserAgent-ProxyHopper-Base

* CPAN Ratings

http://cpanratings.perl.org/d/LWP-UserAgent-ProxyHopper-Base

* Search CPAN

http://search.cpan.org/dist/LWP-UserAgent-ProxyHopper-Base

COPYRIGHT & LICENSE

Top


LWP-UserAgent-ProxyHopper-Base documentation Contained in the LWP-UserAgent-ProxyHopper-Base distribution.

package LWP::UserAgent::ProxyHopper::Base;

use warnings;
use strict;

our $VERSION = '0.002';

use Carp;
use Devel::TakeHashArgs;
use List::MoreUtils 'uniq';
use WWW::FreeProxyListsCom;
use WWW::Proxy4FreeCom;
use base 'Class::Data::Accessor';
__PACKAGE__->mk_classaccessors qw(
    proxify_list
    proxify_bad_list
    proxify_real_bad_list
    proxify_working_list
    proxify_schemes
    proxify_retries
    proxify_debug
    proxify_current
    _proxify_last_load_args
    _proxify_freeproxylists_obj
    _proxify_proxy4free_obj
);

sub proxify_load {
    my $self = shift;
    get_args_as_hash(\@_, \my %args, {
            freeproxylists  => 1,
            plan_b          => 1,
            proxy4free      => 0,
            timeout         => 20,
            debug           => 0,
            retries         => 5,
            extra_proxies   => [],
            schemes         => 'http',
            get_list_args   => {
                freeproxylists  => [ ],
                proxy4free      => [ ],
            },
        },
    ) or croak $@;

    $self->_proxify_last_load_args( \%args );

    my @proxies;

    if ( $args{freeproxylists} ) {
        my $obj = $self->_proxify_freeproxylists_obj(
            WWW::FreeProxyListsCom->new( timeout => $args{timeout} )
        );

        my $list_ref
        = $obj->get_list( @{$args{get_list_args}{freeproxylists}} );
        if ( defined $list_ref ) {
            push @proxies, map { "http://$_->{ip}:$_->{port}/" } @$list_ref;
        }
        else {
            $args{debug}
                and carp 'Failed while trying to get a proxy list from '
                            . 'http://freeproxylists.com: ' . $obj->error;
        }
    }

    if ( $args{proxy4free} or ( !@proxies and $args{plan_b} ) ) {
        my $obj = $self->_proxify_proxy4free_obj(
            WWW::Proxy4FreeCom->new( timeout => $args{timeout} )
        );

        my $list_ref = $obj->get_list( @{$args{get_list_args}{proxy4free}} );

        if ( defined $list_ref ) {
            push @proxies, map { "http://$_->{ip}:$_->{port}/" } @$list_ref;
        }
        else {
            $args{debug}
                and carp 'Failed while trying to get a proxy list from '
                            . 'http://proxy4free.com: ' . $obj->error;
        }
    }

    unshift @proxies, @{ $args{extra_proxies} };

    croak q|Don't have ANY proxy addresses :(|
        unless @proxies;

    @proxies = uniq @proxies;

    $args{debug}
        and carp "Got " . @proxies . " proxies in total";

    $self->proxify_retries( $args{retries} );
    $self->proxify_schemes( $args{schemes} );
    $self->proxify_debug(   $args{debug  } );
    $self->proxify_working_list( [] );
    $self->proxify_bad_list( [] );
    $self->proxify_real_bad_list( [] );

    return $self->proxify_list( \@proxies );
}

sub proxify_get { return shift->_proxify_try_request( 'get', \@_ ); }
sub proxify_post { return shift->_proxify_try_request( 'post', \@_ ); }
sub proxify_request { return shift->_proxify_try_request( 'request', \@_ ); }
sub proxify_head { return shift->_proxify_try_request( 'head', \@_ ); }
sub proxify_mirror { return shift->_proxify_try_request( 'mirror', \@_ ); }
sub proxify_simple_request {
    return shift->_proxify_try_request( 'simple_request', \@_ );
}

sub _proxify_try_request {
    my ( $self, $req_type, $args_ref ) = @_;

    my $current_proxy = $self->_proxify_set_proxy;
    my $tries;
    my $max_tries = $self->proxify_retries;
    TRY_REQ: {
        $tries++;

        my $response = $self->$req_type( @$args_ref );
        if ( $response->is_success ) {
            # a lot of proxies seem to be run by this company and it will
            # give us a 200 but display their page with timeout
            # all we need to do is redo the request
            if ( $response->content =~ /\Qcodeen.cs.princeton.edu">CoDeeN/ ) {
                redo TRY_REQ;
            }
            elsif ( not $self->_proxify_check_success($response->content) ) {
                push @{ $self->proxify_real_bad_list }, $current_proxy;
            }

            push @{ $self->proxify_working_list }, $self->proxify_current;
            return $response;

            redo TRY_REQ
                unless $tries > $max_tries;

            return $response;
        }
        else {
            $self->proxify_debug
                and carp 'Failed on proxify_get(): '
                    . $response->status_line;

            if ( $response->status_line =~ /500.+\Q$current_proxy/
                or $response->code == 400
                or $response->code == 504
                or $response->code == 502
            ) {
                # BAD PROXY!!! NO COOKIE!
                push @{ $self->proxify_real_bad_list }, $current_proxy;
            }
            else {
                push @{ $self->proxify_bad_list }, $current_proxy;
            }
            $current_proxy = $self->_proxify_set_proxy;

            redo TRY_REQ
                unless $tries > $max_tries;

            # if we got here $response is not successfull but that might have
            # nothing to do with proxies at all
            return $response;
        }
    } # TRY_GET:{}
    croak 'I should never get to this point. Please email this message '
            . 'to zoffix@cpan.org. Thank you very much';
}

sub _proxify_set_proxy {
    my $self = shift;

    my $proxy = $self->proxify_current( shift @{ $self->proxify_list } );

    unless ( defined $proxy ) {
        $self->proxify_debug
            and carp 'proxify_list() is exhausted, trying "working" list';
    
        $self->proxify_list( $self->proxify_working_list );
        $self->proxify_working_list([]);
        $proxy = $self->proxify_current( shift @{ $self->proxify_list } );
    }

    unless ( defined $proxy ) {
        $self->proxify_debug
           and carp 'proxify_working_list() is exhausted, trying "bad" list';

        $self->proxify_list( $self->proxify_bad_list );
        $self->proxify_bad_list([]);
        $proxy = $self->proxify_current( shift @{ $self->proxify_list } );
    }

    unless ( defined $proxy ) {
        $self->proxify_debug
           and carp 'lists are exhausted, trying to proxify_load now';

        $self->proxify_load( %{ $self->_proxify_last_load_args || {} });
        $proxy = $self->proxify_current( shift @{ $self->proxify_list } );

        defined $proxy
            or croak 'After trying so hard I still could not get any more'
                . ' proxies to play with :(';
    }

    $self->proxify_debug
        and carp "Using proxy $proxy";
    
    $self->proxy($self->proxify_schemes, $proxy );

    return $proxy;
}

sub _proxify_check_success {
    my ( $self, $content ) = @_;
    return 1 if length $content > 4000;
    if ( $content =~ m|\s*
\Qhttp/1.1 401 Unauthorized\E\s*
\QServer:\E\s*
.+?
\QWWW-Authenticate: Basic realm="ADSL Router \(ANNEX A\)"\E\s*
\QContent-Type: text/html\E\s*
\QConnection: close\E\s*
\s*
\Q<html>\E\s*
\Q<head>\E\s*
\Q<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-9">\E\s*
\Q<META http-equiv="Pragma" CONTENT="no-cache">\E\s*
\Q<META HTTP-EQUIV="Cache-Control" CONTENT="no-cache">\E\s*
\Q<meta HTTP-EQUIV="Expires" CONTENT="Mon, 06 Jan 1990 00:00:01 GMT">\E\s*
|xsm
    ) {
        return 0; # failed 
    }

    if ( $content =~ m|<title>ESPOCH Acceso denegado</title>| ) {
        return 0; # failed
    }
    return 1; # success
}

1;
__END__