| URI-Fetch documentation | Contained in the URI-Fetch distribution. |
URI::Fetch - Smart URI fetching/caching
use URI::Fetch;
## Simple fetch.
my $res = URI::Fetch->fetch('http://example.com/atom.xml')
or die URI::Fetch->errstr;
## Fetch using specified ETag and Last-Modified headers.
$res = URI::Fetch->fetch('http://example.com/atom.xml',
ETag => '123-ABC',
LastModified => time - 3600,
)
or die URI::Fetch->errstr;
## Fetch using an on-disk cache that URI::Fetch manages for you.
my $cache = Cache::File->new( cache_root => '/tmp/cache' );
$res = URI::Fetch->fetch('http://example.com/atom.xml',
Cache => $cache
)
or die URI::Fetch->errstr;
URI::Fetch is a smart client for fetching HTTP pages, notably syndication feeds (RSS, Atom, and others), in an intelligent, bandwidth- and time-saving way. That means:
If you have Compress::Zlib installed, URI::Fetch will automatically try to download a compressed version of the content, saving bandwidth (and time).
If you use a local cache (see the Cache parameter to fetch), URI::Fetch will keep track of the Last-Modified and ETag headers from the server, allowing you to only download pages that have been modified since the last time you checked.
Certain HTTP error codes are special, particularly when fetching syndication feeds, and well-written clients should pay special attention to them. URI::Fetch can only do so much for you in this regard, but it gives you the tools to be a well-written client.
The response from fetch gives you the raw HTTP response code, along with special handling of 4 codes:
Signals that the content of a page/feed was retrieved successfully.
Signals that a page/feed has moved permanently, and that your database of feeds should be updated to reflect the new URI.
Signals that a page/feed has not changed since it was last fetched.
Signals that a page/feed is gone and will never be coming back, so you should stop trying to fetch it.
Fetches a page identified by the URI $uri.
On success, returns a URI::Fetch::Response object; on failure, returns
undef.
%param can contain:
LastModified and ETag can be supplied to force the server to only return the full page if it's changed since the last request. If you're writing your own feed client, this is recommended practice, because it limits both your bandwidth use and the server's.
If you'd rather not have to store the LastModified time and ETag yourself, see the Cache parameter below (and the SYNOPSIS above).
If you'd like URI::Fetch to cache responses between requests, provide
the Cache parameter with an object supporting the Cache API (e.g.
Cache::File, Cache::Memory). Specifically, an object that supports
$cache->get($key) and $cache->set($key, $value, $expires).
If supplied, URI::Fetch will store the page content, ETag, and last-modified time of the response in the cache, and will pull the content from the cache on subsequent requests if the page returns a Not-Modified response.
Optional. You may provide your own LWP::UserAgent instance. Look into LWPx::ParanoidUserAgent if you're fetching URLs given to you by possibly malicious parties.
Optional. Controls the interaction between the cache and HTTP requests with If-Modified-Since/If-None-Match headers. Possible behaviors are:
If a page is in the cache, the origin HTTP server is always checked for a fresher copy with an If-Modified-Since and/or If-None-Match header.
1If set to 1, the origin HTTP is never contacted, regardless of the
page being in cache or not. If the page is missing from cache, the
fetch method will return undef. If the page is in cache, that page
will be returned, no matter how old it is. Note that setting this
option means the URI::Fetch::Response object will never have the
http_response member set.
N, where N > 1The origin HTTP server is not contacted if the page is in cache and the cached page was inserted in the last N seconds. If the cached copy is older than N seconds, a normal HTTP request (full or cache check) is done.
Optional. A subref that gets called with a scalar reference to your content so you can modify the content before it's returned and before it's put in cache.
For instance, you may want to only cache the <head> section of an HTML document, or you may want to take a feed URL and cache only a pre-parsed version of it. If you modify the scalarref given to your hook and change it into a hashref, scalarref, or some blessed object, that same value will be returned to you later on not-modified responses.
Optional. A subref that gets called with the URI::Fetch::Response
object about to be cached (with the contents already possibly transformed by
your ContentAlterHook). If your subref returns true, the page goes
into the cache. If false, it doesn't.
Optional. Subrefs that get called to serialize and deserialize, respectively, the data that will be cached. The cached data should be assumed to be an arbitrary Perl data structure, containing (potentially) references to arrays, hashes, etc.
Freeze should serialize the structure into a scalar; Thaw should deserialize the scalar into a data structure.
By default, Storable will be used for freezing and thawing the cached data structure.
Optional. A boolean that indicates a URI::Fetch::Response
should be returned regardless of the HTTP status. By
default undef is returned when a response is not a
"success" (200 codes) or one of the recognized HTTP status
codes listed above. The HTTP status message can then be retreived
using the errstr method on the class.
URI::Fetch is free software; you may redistribute it and/or modify it under the same terms as Perl itself.
Except where otherwise noted, URI::Fetch is Copyright 2004 Benjamin Trott, ben+cpan@stupidfool.org. All rights reserved.
| URI-Fetch documentation | Contained in the URI-Fetch distribution. |
package URI::Fetch; use strict; use 5.008_001; use base qw( Class::ErrorHandler ); use LWP::UserAgent; use Carp qw( croak ); use URI; use URI::Fetch::Response; our $VERSION = '0.09'; our $HAS_ZLIB; BEGIN { $HAS_ZLIB = eval "use Compress::Zlib (); 1;"; } use constant URI_OK => 200; use constant URI_MOVED_PERMANENTLY => 301; use constant URI_NOT_MODIFIED => 304; use constant URI_GONE => 410; sub fetch { my $class = shift; my($uri, %param) = @_; # get user parameters my $cache = delete $param{Cache}; my $ua = delete $param{UserAgent}; my $p_etag = delete $param{ETag}; my $p_lastmod = delete $param{LastModified}; my $content_hook = delete $param{ContentAlterHook}; my $p_no_net = delete $param{NoNetwork}; my $p_cache_grep = delete $param{CacheEntryGrep}; my $freeze = delete $param{Freeze}; my $thaw = delete $param{Thaw}; my $force = delete $param{ForceResponse}; croak("Unknown parameters: " . join(", ", keys %param)) if %param; my $ref; if ($cache) { unless ($freeze && $thaw) { require Storable; $thaw = \&Storable::thaw; $freeze = \&Storable::freeze; } if (my $blob = $cache->get($uri)) { $ref = $thaw->($blob); } } # NoNetwork support (see pod docs below for logic clarification) if ($p_no_net) { croak("Invalid NoNetworkValue (negative)") if $p_no_net < 0; if ($ref && ($p_no_net == 1 || $ref->{CacheTime} > time() - $p_no_net)) { my $fetch = URI::Fetch::Response->new; $fetch->status(URI_OK); $fetch->content($ref->{Content}); $fetch->etag($ref->{ETag}); $fetch->last_modified($ref->{LastModified}); $fetch->content_type($ref->{ContentType}); return $fetch; } return undef if $p_no_net == 1; } $ua ||= do { my $ua = LWP::UserAgent->new; $ua->agent(join '/', $class, $class->VERSION); $ua->env_proxy; $ua; }; my $req = HTTP::Request->new(GET => $uri); if ($HAS_ZLIB) { $req->header('Accept-Encoding', 'gzip'); } if (my $etag = ($p_etag || $ref->{ETag})) { $req->header('If-None-Match', $etag); } if (my $ts = ($p_lastmod || $ref->{LastModified})) { $req->if_modified_since($ts); } my $res = $ua->request($req); my $fetch = URI::Fetch::Response->new; $fetch->uri($uri); $fetch->http_status($res->code); $fetch->http_response($res); $fetch->content_type($res->header('Content-Type')); if ($res->previous && $res->previous->code == HTTP::Status::RC_MOVED_PERMANENTLY()) { $fetch->status(URI_MOVED_PERMANENTLY); $fetch->uri($res->previous->header('Location')); } elsif ($res->code == HTTP::Status::RC_GONE()) { $fetch->status(URI_GONE); $fetch->uri(undef); return $fetch; } elsif ($res->code == HTTP::Status::RC_NOT_MODIFIED()) { $fetch->status(URI_NOT_MODIFIED); $fetch->content($ref->{Content}); $fetch->etag($ref->{ETag}); $fetch->last_modified($ref->{LastModified}); $fetch->content_type($ref->{ContentType}); return $fetch; } elsif (!$res->is_success) { return $force ? $fetch : $class->error($res->message); } else { $fetch->status(URI_OK); } $fetch->last_modified($res->last_modified); $fetch->etag($res->header('ETag')); my $content = $res->content; if ($res->content_encoding && $res->content_encoding eq 'gzip') { $content = Compress::Zlib::memGunzip($content); } # let caller-defined transform hook modify the result that'll be # cached. perhaps the caller only wants the <head> section of # HTML, or wants to change the content to a parsed datastructure # already serialized with Storable. if ($content_hook) { croak("ContentAlterHook is not a subref") unless ref $content_hook eq "CODE"; $content_hook->(\$content); } $fetch->content($content); # cache by default, if there's a cache. but let callers cancel # the cache action by defining a cache grep hook if ($cache && ($p_cache_grep ? $p_cache_grep->($fetch) : 1)) { $cache->set($fetch->uri, $freeze->({ ETag => $fetch->etag, LastModified => $fetch->last_modified, Content => $fetch->content, CacheTime => time(), ContentType => $fetch->content_type, })); } $fetch; } 1; __END__