WWW::Yahoo::Groups - Automated access to Yahoo! Groups archives.


WWW-Yahoo-Groups documentation Contained in the WWW-Yahoo-Groups distribution.

Index


Code Index:

NAME

Top

WWW::Yahoo::Groups - Automated access to Yahoo! Groups archives.

SYNOPSIS

Top

    my $y = WWW::Yahoo::Groups->new();
    $y->login( $user => $pass );
    $y->list( 'Jade_Pagoda' );
    my $email = $y->fetch_message( 2345 );

    # Error catching
    my $email = eval { $y->fetch_message( 93848 ) };
    if ( $@ and ref $@ and $@->isa('X::WWW::Yahoo::Groups') )
    {
        warn "Problem: ".$@->error;
    }

DESCRIPTION

Top

WWW::Yahoo::Groups retrieves messages from the archive of Yahoo Groups. It provides a simple OO interface to logging in and retrieving said messages which you may then do with as you will.

Things it does

USAGE

Top

Try to be a well behaved bot and sleep() for a few seconds (at least) after doing things. It's considered polite. There's an autosleep method that should be useful for this. Recently, this has been set to a default of 1 second. Feel free to tweak if necessary.

If you're used to seeing munged email addresses when you view the message archive (i.e. you're not a moderator or owner of the group) then you'll be pleased to know that WWW::Yahoo::Groups can demunge those email addresses.

All exceptions are subclasses of X::WWW::Yahoo::Groups, itself a subclass of Exception::Class. See WWW::Yahoo::Groups::Errors for details.

OTHER DOCUMENTATION

Top

Spidering Hacks, by Kevin Hemenway and Tara Calishain

Spidering Hacks from O'Reilly (http://www.oreilly.com/catalog/spiderhks/) is a great book for anyone wanting to know more about screen-scraping and spidering.

There is a WWW::Yahoo::Groups based hack by Andy Lester:

44 Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups

and two hacks, not related to this module, by me, Iain Truskett:

19 Scraping with HTML::TreeBuilder

METHODS

Top

Constructor

new

Create a new WWW::Yahoo::Groups robot.

    my $y = WWW::Yahoo::Groups->new();

It can take a has of named arguments. Two arguments are defined: debug and autosleep. They correspond to the methods of the same name.

    my $y = WWW::Yahoo::Groups->new(
        debug => 1,
        autosleep => 4,
    );

Options

debug

Enable/disable/read debugging mode.

    $y->debug(0); # Disable
    $y->debug(1); # Enable
    warn "Debugging!" if $y->debug();

The debug method of the current agent object will be invoked with the truth of the argument. This usually means debug in WWW::Yahoo::Groups::Mechanize.

autosleep

If given a parameter, it sets the numbers of seconds to sleep. Otherwise, it returns the number. Defaults to 1 second.

    $y->autosleep( 5 ); # Set it to 5.
    sleep ( $y->autosleep() );

May throw X::WWW::Yahoo::Groups::BadParam if given invalid parameters.

This is used by get. If autosleep is set, then get will sleep() for the specified period after every fetch.

Implemented by the object returned by agent. By default this means autosleep in WWW::Yahoo::Groups::Mechanize.

Logging in and out

login

Logs the robot into the Yahoo! Groups system.

    $y->login( $user => $passwd );

May throw:

logout

Logs the robot out of the Yahoo! Groups system.

    $y->logout();

May throw:

loggedin

Returns 1 if you are logged in, else 0. Note that this merely tests if you've used the login method successfully, not whether the Yahoo! site has expired your session.

   print "Logged in!\n" if $w->loggedin();

Setting target list and finding possible lists

list

If given a parameter, it sets the list to use. Otherwise, it returns the current list, or undef if no list is set.

IMPORTANT: list name must be correctly cased as per how Yahoo! Groups cases it. If not, you may experience odd behaviour.

    $y->list( 'Jade_Pagoda' );
    my $list = $y->list();

May throw X::WWW::Yahoo::Groups::BadParam if given invalid parameters.

See also lists for how to get a list of possible lists.

lists

If you'd like a list of the groups to which you are subscribed, then use this method.

    my @groups = $w->lists();

May throw X::WWW::Yahoo::Groups::BadParam if given invalid parameters, or X::WWW::Yahoo::Groups::BadFetch if it cannot fetch any of the appropriate pages from which it extracts the information.

Note that it does handle people with more than one page of groups.

List information

first_msg_id

Returns the lowest message number with the archive.

    my $first = $w->first_msg_id();

It will throw X::WWW::Yahoo::Groups::NoListSet if no list has been specified with lists, X::WWW::Yahoo::Groups::UnexpectedPage if the page fetched does not contain anything we thought it would, and X::WWW::Yahoo::Groups::BadFetch if it is unable to fetch the page it needs.

last_msg_id

Returns the highest message number with the archive.

    my $last = $w->last_msg_id();
    # Fetch last 10 messages:
    for my $number ( ($last-10) .. $last )
    {
        push @messages, $w->fetch_message( $number );
    }

It will throw X::WWW::Yahoo::Groups::NoListSet if no list has been specified with lists, X::WWW::Yahoo::Groups::UnexpectedPage if the page fetched does not contain anything we thought it would, and X::WWW::Yahoo::Groups::BadFetch if it is unable to fetch the page it needs.

Fetching an actual message

fetch_message

Fetches a specified message from the list's archives. Returns it as a mail message (with headers) suitable for saving into a Maildir.

    my $message = $y->fetch_message( 435 );

May throw any of:

reformat_headers

This does some simple reformatting of headers. Yahoo!Groups seems to manage to mangle multiline headers. This is particularly noticable with the Received header.

The rule is that any line that starts with a series of lowercase letters or hyphens that is NOT immediately followed by a colon is regarded as being part of the previous line and is indented with a space character (as per RFC2822).

Input to this method should be a whole message. Output is that same message, with the headers repaired.

This method is called by fetch_message but this was not always the case. If you have archives that predate this implicit call, you may want to run messages through this routine.

Other methods

fetch_rss

Returns the RSS for the group's most recent messages. See XML::Filter::YahooGroups for ways to process this RSS into containing the message bodies.

    my $rss = $w->fetch_rss();

If a parameter is given, it will return that many items in the RSS file. The number must be between 1 and 100 inclusive.

    my $rss = $w->fetch_rss( 10 );

PRIVATE METHODS

Top

agent

Returns or sets the WWW::Mechanize based agent. Not for general use. If you must fiddle with it, your object's API must match that of WWW::Yahoo::Groups::Mechanize and WWW::Mechanize.

get

Fetch a given URL. Delegated to "get" in WWW::Yahoo::Groups::Mechanize (well, the get method of the object returned by agent).

decode_protected

This method does nothing as Yahoo changed their algorithm.

_check_protected

This checks whether a given URL is to a protected email or not. It returns $text regardless as I do not have a decoding algorithm for Yahoo's updated encoding scheme.

    my $text = $self->_check_protected( $url, $text );

THANKS

Top

Simon Hanmer for having problems with the module, thus resulting in improved error reporting, param validation and corrected prerequisites. Since then, Simon also provided a basis for the lists and last_msg_id methods and is causing me to think harder about my exceptions.

Aaron Straup Cope (ASCOPE) for writing XML::Filter::YahooGroups which uses this module for retrieving message bodies to put into RSS.

Randal Schwartz (MERLYN) for pointing out some problems back in 1.4 and noting problems caused by the hash randomisation.

Ray Cielencki (SLINKY) for first_msg_id and "Age Restricted" notice bypassing.

Vadim Zeitlin for yahoo2mbox from which I blatantly stole some features. (Well, I say stole but yahoo2mbox is public domain).

Andy Lester (PETDANCE) for writing about this module in Spidering Hacks.

iTerrence Brannon (TBONE) for reporting the example program and empty body bugs.

BUGS

Top

Support for this module is provided courtesy the CPAN RT system via the web or email:

    http://perl.dellah.org/rt/yahoogroups
    bug-www-yahoo-groups@rt.cpan.org

This makes it much easier for me to track things and thus means your problem is less likely to be neglected.

Please include the versions of WWW::Yahoo::Groups and Perl that you are using and, if possible, the name of the group and the number of any messages you are having trouble with.

LICENCE AND COPYRIGHT

Top

AUTHOR

Top

Iain Truskett <spoon@cpan.org>

SEE ALSO

Top

perl, XML::Filter::YahooGroups, http://groups.yahoo.com/.

WWW::Mechanize, Exception::Class.

http://www.lpthe.jussieu.fr/~zeitlin/yahoo2mbox.html


WWW-Yahoo-Groups documentation Contained in the WWW-Yahoo-Groups distribution.
package WWW::Yahoo::Groups;
use strict;
use warnings FATAL => 'all';

our $VERSION = '1.91';

use Carp;
use HTTP::Cookies;
use HTML::Entities;
use Params::Validate qw( :all );
use WWW::Yahoo::Groups::Mechanize;

require WWW::Yahoo::Groups::Errors; 
Params::Validate::validation_options(
    WWW::Yahoo::Groups::Errors->import()
);

sub new
{
    my $class = shift;
    my %args = ( debug => 0, autosleep => 1, @_ );
    my $self = bless {}, $class;
    my $w = WWW::Yahoo::Groups::Mechanize->new();
    $self->agent($w);
    $self->debug( $args{debug} );
    $self->autosleep( $args{ autosleep } );
    return bless $self, $class;
}

sub debug
{
    my $self = shift;
    if (@_) {
	my $true = ($_[0] ? 1 : 0);
	$self->{__PACKAGE__.'-debug'} = $true;
	$self->agent->debug( $true );
    }
    $self->{__PACKAGE__.'-debug'};
}

sub autosleep { my $self = shift; $self->agent->autosleep(@_) }

sub login
{
    my $self = shift;
    my %p;
    @p{qw( user pass )} = validate_pos( @_,
	{ type => SCALAR, }, # user
	{ type => SCALAR, }, # pass
    );
    my $w = $self->agent();
    my $rv = eval {
	X::WWW::Yahoo::Groups::AlreadyLoggedIn->throw(
	    "You must logout before you can log in again.")
		if $self->loggedin;

	$w->get('http://groups.yahoo.com/');
	$w->follow('Sign In');
	$w->field( login => $p{user} );
	$w->field( passwd => $p{pass} );
	$w->click();
	if (my ($error) = $w->res->content =~ m!
	    	    \Q<font color=red face=arial><b>\E
	    	    \s+
	    	    (.*?)
	    	    \s+
	    	    \Q</b></font></td></tr></table>\E
	    	    !xsm)
	{
	    X::WWW::Yahoo::Groups::BadLogin->throw(
		fatal => 1,
		error => $error);
	}
	else
	{
	    while (my $url = $w->res->header('Location'))
	    {
		$self->get( $url );
	    }
	    if ( $w->content =~ m[
				\Qwindow.location.replace("http://groups.yahoo.com/");\E
				]x )
	    {
		$self->{__PACKAGE__.'-loggedin'} = 1;
	    } else {
		X::WWW::Yahoo::Groups::BadLogin->throw(
		    fatal => 1,
		    error => "Nope. That's not a good login.");
	    }
	}
	0;
    };
    if ($@) {
	die $@ unless ref $@;
	$@->rethrow if $@->fatal;
	$rv = $@;
    }
    return $rv;
}

sub logout
{
    my $self = shift;
    my $w = $self->agent;
    validate_pos( @_ );
    my $rv = eval {
	X::WWW::Yahoo::Groups::NotLoggedIn->throw(
	    "You can not log out if you are not logged in.")
		unless $self->loggedin;
	delete $self->{__PACKAGE__.'-loggedin'};

	$w->get('http://groups.yahoo.com/');

	X::WWW::Yahoo::Groups::NotLoggedIn->throw(
	    "You can not log out if you are not logged in.")
		unless $w->follow('Sign Out');

	$w->follow('Return to Yahoo! Groups');
	my $res = $w->res;
	while ($res->is_redirect)
	{
	    # We do this manually because it doesn't work automatically for
	    # some reason. I suspect we hit a redirection limit in LWP.
	    my $url = $res->header('Location');
	    $w->get($url);
	    $res = $w->res;
	}
	0;
    };
    if ($@) {
	die $@ unless ref $@;
	$@->rethrow if $@->fatal;
	$rv = $@;
    }
    return $rv;
}

sub loggedin
{
    my $self = shift;
    validate_pos( @_ );
    if (exists $self->{__PACKAGE__.'-loggedin'}
	    and $self->{__PACKAGE__.'-loggedin'})
    {
	return 1;
    }
    return 0;
}

sub list
{
    my $self = shift;
    if (@_) {
	my ($list) = validate_pos( @_,
	    { type => SCALAR, callbacks => {
		    'defined and of length' => sub {
			defined $_[0] and length $_[0]
		    },
		    'appropriate characters' => sub {
			defined $_[0] and $_[0] =~ /^ [\w-]+ $/x;
		    },
		}}, # list
	);
	delete @{$self}{qw( first last )};
	$self->{__PACKAGE__.'-list'} = $list;
    }
    return $self->{__PACKAGE__.'-list'};
}

sub lists
{
    my $self = shift;
    validate_pos( @_ );
    X::WWW::Yahoo::Groups::NotLoggedIn->throw(
	"Must be logged in to get a list of groups.")
	    unless $self->loggedin;

    my %lists;

    my $next = 'http://groups.yahoo.com/mygroups';
    my $group_RE = qr# /group/ ([\w-]+?) \Q?yguid=\E #x;
    my $w = $self->agent;
    do {
	$w->get( $next );
	undef $next;

	my @lists = map {
	    $_->url =~ $group_RE; $1;
	} $w->find_all_links(
	    url_regex => $group_RE,
	);
	@lists{@lists} = 1;

	if (my $url = $w->find_link( text => 'Next' ) )
	{
	    $next = $url->url;
	}
    } until ( not defined $next );

    return (sort keys %lists);
}

sub get_extent
{
    my $self = shift;
    validate_pos( @_ );
    my $list = $self->list();
    X::WWW::Yahoo::Groups::NoListSet->throw(
	"Cannot determine archive extent without a list being specified.")
	    unless defined $list and length $list;

    my $w = $self->agent;
    $w->get( "http://groups.yahoo.com/group/$list/messages/1" );
    my ($first, $last) = $w->res->content =~ m!
		<TITLE>
		[^<]+? : \s+
		(\d+)-\d+ \s+ (?:of|de|von|di|/) \s+
		(\d+)
		[^<]*?
		</TITLE>
        !six;

    X::WWW::Yahoo::Groups::UnexpectedPage->throw(
	"Unexpected title format. Perhaps group has no archive.")
	    unless defined $first;

    @{$self}{qw( first last )} = ( $first, $last );
    return ( $first, $last );
}

sub first_msg_id
{
    my $self = shift;
    validate_pos( @_ );
    $self->get_extent unless exists $self->{first};
    return $self->{first};
}

sub last_msg_id
{
    my $self = shift;
    validate_pos( @_ );
    $self->get_extent unless exists $self->{last};
    return $self->{last};
}

sub fetch_message
{
    my $self = shift;
    my ($number) = validate_pos( @_,
	{ type => SCALAR, callbacks => {
		'is positive integer' => sub { $_[0] =~ /^ (?!0+$) \d+ $/x },
	    } }, # number
    );
    my $list = $self->list();
    X::WWW::Yahoo::Groups::NoListSet->throw(
	"Cannot fetch a message without a list being specified.")
	unless defined $list and length $list;
    my $template = "http://groups.yahoo.com/group/$list/message/%d?source=1&unwrap=1";
    my $w = $self->agent;
    $w->get(sprintf $template, $number);
    my $res = $w->res;
    while ($res->is_redirect)
    {
	# We do this manually because it doesn't work automatically for
	# some reason. I suspect we hit a redirection limit in LWP.
	my $url = $res->header('Location');
	$w->get($url);
	$res = $w->res;
    }
    my $content = $res->content;
    if ( $w->uri =~ m,/interrupt\?st,gsm )
    {
	# If it's one of those damn interrupting ads, then click
	# through.
	$w->follow_link( url_regex => qr{ /\Q$list\E/message/\d+ }x );
	$res = $w->res;
	$content = $res->content;
    }

    # See if it's a missing article.
    if ($content =~ m!
		<br>
		\s+
		<blockquote>
		\s+
		\QMessage $number does not exist in $list\E
		</blockquote>
		!smx)
    {
	X::WWW::Yahoo::Groups::NotThere->throw(
	    "Message $number is not there.");
    }

    # Strip content boundaries
    $content =~ s/ ^ .*? \Q<!-- start content include -->\E //sx and
    $content =~ s/ \Q<!-- end content include -->\E .* $ //sx and

    # Strip table wrappings
    $content =~ s/ ^ .*? <table[^>]+> .*? <tt> //sx and
    $content =~ s! <br> \n <tt> !\n!xg and
    $content =~ s! <br> \n </td></tr> \n </table> .* $ !\n!sx and

    # Munge content
    $content =~ s{ <a \s+ href=" ([^"]+) "> ([^<]+) </a> }{
                $self->_check_protected($1,$2) }egx or
    X::WWW::Yahoo::Groups::UnexpectedPage->throw(
        "Message $number doesn't appear to be formatted as we like it.");

    for ($content)
    {
        s! </tt> !!xg;
        s/ ^ (--\w+--) <br> \n /$1\n\n/mgx;
        s/ <BR>\n /\n/igx;
        s/ <BR> //igx;
        s/(\n)\n+$/$1/;
        s{\Q<i>[\E(\QAttachment content not displayed.\E)\Q]</i>\E}{XXX $1 XXX\n}xg;
    }
    decode_entities($content);
    $content = $self->reformat_headers( $content );

    # Return
    return $content;
}

sub reformat_headers
{
    my ($self, $msg) = @_;

    my ($header, $body) = split /\n\n/, $msg, 2;

    $header =~ s/^ (?! (?:From\ |[a-z-]+:) ) / /igmx;
    $body = '' unless defined $body;

    return $header."\n\n".$body;
}

sub fetch_rss
{
    my $self = shift;
    my %opts;
    @opts{qw( count )} = validate_pos( @_,
	{ type => SCALAR, optional => 1, callbacks => {
		'is positive integer below 101' => sub {
                    $_[0] =~ /^ (?!0+$) \d+ $/x and $_[0] <= 100
                },
	    } }, # number
    );
    #             href="http://groups.yahoo.com/group/rss-dev/messages?rss=1&amp;viscount=30">
    my $list = $self->list();
    X::WWW::Yahoo::Groups::NoListSet->throw(
	"Cannot fetch a list's RSS without a list being specified.")
	    unless defined $list and length $list;
    my $url = "http://groups.yahoo.com/group/$list/messages?rss=1";
    $url .= "&viscount=$opts{count}" if $opts{count};
    my $w = $self->agent;
    $w->get( $url );
    my $content = $w->res->content;
    X::WWW::Yahoo::Groups::UnexpectedPage->throw(
	"Thought we were getting RSS. Got something else.")
            unless $content =~ m[^
                                \Q<?xml version="1.0" ?>\E \s*
                                \Q<rss version="2.0">\E    \s*
                                \Q<channel>\E
        ]sx;
    return $content;
}

sub agent
{
    my $self = shift;
    @_ ? ( $self->{agent} = $_[0], $self ) : $self->{agent};
}

sub get { my $self = shift; $self->agent->get(@_) }

sub decode_protected
{
    my ($self, $code) = @_;
    return;
}

sub _check_protected
{
    my ( $self, $href, $text ) = @_;
    return $text;
}

1;
__END__