WWW::Scraper::ISBN::AmazonFR_Driver - Search driver for the (FR) Amazon online


WWW-Scraper-ISBN-AmazonFR_Driver documentation Contained in the WWW-Scraper-ISBN-AmazonFR_Driver distribution.

Index


Code Index:

NAME

Top

WWW::Scraper::ISBN::AmazonFR_Driver - Search driver for the (FR) Amazon online catalog.

SYNOPSIS

Top

See parent class documentation (WWW::Scraper::ISBN::Driver)

DESCRIPTION

Top

Searches for book information from the (FR) Amazon online catalog. This module is a mere paste and translation of WWW::Scraper::ISBN::AmazonUS_Driver. The main (only?) difference is in the parsing of the result. Here it is done with simple regexp, whereas in AmazonUS_Driver it was done using Template::Extract.

METHODS

Top

search()

Creates a query string, then passes the appropriate form fields to the Amazon (FR) server.

The returned page should be the correct catalog page for that ISBN. If not the function returns zero and allows the next driver in the chain to have a go. If a valid page is returned, the following fields are returned via the book hash:

  isbn
  author
  title
  book_link
  thumb_link
  image_link
  pubdate
  publisher

The book_link, thumb_link and image_link refer back to the Amazon (FR) website.

DIAGNOSTICS

Top

search() set the attribute handler of the object it returns. Several cases are possible :

Impossibility to reach Amazon.fr
  Error loading amazon.fr form web page (unreachable?)

Wrong web page or possible changes in Amazon designed
  Error parsing amazon.fr form

Lost connection to Amazon or possible changes in Amazon designed
  Error about form submission (form changed?)

Error in parsing the answer of Amazon (my mistake?)
  Could not extract data from amazon.fr result page

BUGS and LIMITATIONS

Top

The following message can appear on STDERR (up to 2 times by request?) from time to time

    Malformed UTF-8 character (unexpected end of string)
    in subroutine entry at
    (/some/path/to/the/module)/HTML/PullParser.pm line 83

This doesn't prevent search() from completing its job and this doesn't seems to be deterministic.

The calls $mechanize->get( SEARCH ) (1 message) and $mechanize->submit() (2 messages) in search() seams to be responsible for this. So, I am tempted to blame amazon, but I didn't checked.

REQUIRES

Top

Requires the following modules be installed:

WWW::Scraper::ISBN::Driver
WWW::Mechanize

SEE ALSO

Top

WWW::Scraper::ISBN
WWW::Scraper::ISBN::Record
WWW::Scraper::ISBN::Driver

AUTHOR

Top

Fabien GALAND, <galand@cpan.org>

CREDIT

Top

This is a mere paste and translation of WWW::Scraper::ISBN::AmazonUS_Driver written by Barbie, <barbie@cpan.org>.

COPYRIGHT

Top


WWW-Scraper-ISBN-AmazonFR_Driver documentation Contained in the WWW-Scraper-ISBN-AmazonFR_Driver distribution.
package WWW::Scraper::ISBN::AmazonFR_Driver;

use strict;
use warnings;

use vars qw($VERSION);
$VERSION = '0.02';

#--------------------------------------------------------------------------

#--------------------------------------------------------------------------

###########################################################################
#Inheritence		                                                      #
###########################################################################

use base qw(WWW::Scraper::ISBN::Driver);

###########################################################################
#Library Modules                                                          #
###########################################################################

use WWW::Mechanize;

###########################################################################
#Constants                                                                #
###########################################################################

use constant	AMAZON	=> 'http://www.amazon.fr/';
use constant	SEARCH	=> 'http://www.amazon.fr/';

#--------------------------------------------------------------------------

###########################################################################
#Interface Functions                                                      #
###########################################################################

sub search {
	my $self = shift;
	my $isbn = shift;
	$self->found(0);
	$self->book(undef);

	my $mechanize = WWW::Mechanize->new();
	$mechanize->get( SEARCH );
	return	$self->handler('Error loading amazon.fr form web page (unreachable?)')
	    unless($mechanize->success());


	my ($index,$input) = (0,0);

	$mechanize->form_name('site-search')
	    or return $self->handler('Error parsing amazon.fr form');

	my $keyword;
	# This is to search for books 
        # (<select name="url"><option name="url" value="">... 
	$keyword ='search-alias=stripbooks';
	$mechanize->set_fields( 
				'field-keywords' => $isbn, 
				'url' => $keyword 
				);
	$mechanize->submit();


	return	$self->handler('Error about form submission (form changed?)') 
	    unless($mechanize->success());


        my $content=$mechanize->content();
	my ($con,$thumb, $image, $pub);

 	if(
	   $content =~ s{
	       	       .*
	   	   <meta \s  name="description"  \s content=" ( [^"]* ) "     .*
                      <div  \s class="buying">                                   .*
                      <script \s language=                                       .*
                      function \s registerImage 
                                                  }{}msx
           )
           {$con=$1;}

	  if($content =~ s{

          <script>                                            .*
          registerImage\("original_image",
		   		   \s " ( [^"]* )  ",

                                                    }{}msx )
        {$thumb=$1;}

        if($content =~ s{
                                \s "<a \s href="\+'"'\+" ( [^"]* ) "\+          .*
                                <b \s class="h1">Description \s du \s produit</b><br\s /> 
}{}msx ){$image=$1};

         if($content =~s{
          <li><b>Editeur \s :</b> ( (?: [^\n](?!</li>) )* )
                                }{}msx)
        {$pub=$1;}



        my $data = {};
        $data->{content} = $con;
        $data->{thumb_link} = $thumb;
        $data->{image_link} =$image;
        $data->{published}  =$pub;

	return $self->handler("Could not extract data from amazon.fr result page.")
		unless(defined $data);

	# trim top and tail
	foreach (keys %$data) { 
            next unless defined $data->{$_};
            $data->{$_} =~ s/^\s+//;
            $data->{$_} =~ s/\s+$//;
        }

	($data->{title},$data->{author}) = 
		($data->{content} =~ 
                  /Amazon.fr\s*:\s*
                                    (.*)
                                    :\s*Livres?.*
                                    by\s+(.*)/x);


	($data->{publisher},$data->{pubdate}) = 
		($data->{published} =~ /\s*(.*?)(?:;.*?)?\s+\(([^)]*)/);

	my $bk = {
		'isbn'			=> $isbn,
		'author'		=> $data->{author},
		'title'			=> $data->{title},
		'image_link'	        => $data->{image_link},
		'thumb_link'	        => $data->{thumb_link},
		'publisher'		=> $data->{publisher},
		'pubdate'		=> $data->{pubdate},
		'book_link'		=> $mechanize->uri()
	};
	$self->book($bk);
	$self->found(1);
	return $self->book;
}

1;
__END__