| Pod-Extract-URI documentation | Contained in the Pod-Extract-URI distribution. |
Pod::Extract::URI - Extract URIs from POD
use Pod::Extract::URI; # Get a list of URIs from a file my @uris = Pod::Extract::URI->uris_from_file( $file ); # Or filehandle my @uris = Pod::Extract::URI->uris_from_filehandle( $filehandle ); # Or the full OO my $parser = Pod::Extract::URI->new(); $parser->parse_from_file( $file ); my @uris = $parser->uris(); my %uri_details = $parser->uri_details();
This module parses POD and uses URI::Find or URI::Find::Schemeless
to extract any URIs it can.
Create a new Pod::Extract::URI object.
new() takes an optional hash of options, whose names correspond to
object methods described in more detail below.
Should the parser try to extract schemeless URIs (using URI::Find::Schemeless)?
Should the parser only look for URIs in L<> sequences?
Should the parser look in POD text paragraph, verbatim blocks, or commands?
Restrict URIs to the schemes in the arrayref.
Exclude URIs with the schemes in the arrayref.
An arrayref of patterns to ignore.
A reference to a subroutine to run for each URI to see if the URI should be ignored.
Convert the URIs found to their canonical form.
Strip extra brackets which may appear around the URL returned by URI::Find. See method below for more details.
Get/set the L_only flag. Takes one optional true/false argument to set the L_only flag. Defaults to false.
If true, Pod::Extract::URI will look for URIs only in L<>
sequences, otherwise it will look anywhere in the POD.
Get/set the want_command flag. Takes one optional true/false argument to set the want_command flag. Defaults to true.
If true, Pod::Extract::URI will look for URIs in command blocks (i.e.
=head1, etc.).
Get/set the want_textblock flag. Takes one optional true/false argument to set the want_textblock flag. Defaults to true.
If true, Pod::Extract::URI will look for URIs in textblocks (i.e.
paragraphs).
Get/set the want_verbatim flag. Takes one optional true/false argument to set the want_verbatim flag. Defaults to true.
If true, Pod::Extract::URI will look for URIs in verbatim blocks (i.e.
code examples, etc.).
$peu->schemes( [ 'http', 'ftp' ] );
Get/set the list of schemes to search for. Takes an optional arrayref of schemes to set.
If there are no schemes, Pod::Extract::URI will look for all schemes.
$peu->exclude_schemes( [ 'mailto', 'https' ] );
Get/set the list of schemes to ignore. Takes an optional arrayref of schemes to set.
$peu->stop_uris( [
qr/example\.com/,
'foobar.com'
] );
Get/set a list of patterns to apply to each URI to see if it should be ignored. Takes an optional arrayref of patterns to set. Strings in the list will be automatically converted to patterns (using qr//).
The URIs will be checked against the canonical URI form if use_canonical
has been specified. Otherwise, they will be checked against the URI as it
appears in the POD. If strip_brackets is specified, the brackets (and
"URL:" prefix, if present) will be removed before testing.
Any URI that matches a pattern will be ignored.
sub exclude {
my $uri = shift;
return ( $uri->host =~ /example\.com/ ) ? 1 : 0;
}
$peu->stop_sub( \&exclude );
Get/set a subroutine to check each URI found to see if it should be ignored. Takes an optional coderef to set.
The subroutine will be passed a reference to the URI object, the text found
by URI::Find, and a reference to the Pod::Extract::URI object. If it
returns true, the URI will be ignored.
Get/set the use_canonical flag. Takes one optional true/false argument to set the use_canonical flag. Defaults to false.
If true, Pod::Extract::URI will store the URIs it finds in the canonical
form (as returned by URI-canonical()>. The original URI and text will
still be available via uri_details().
Get/set the strip_brackets flag. Takes one optional true/false argument to set the strip_brackets flag. Defaults to true.
RFC 2396 Appendix E suggests the form <http://www.example.com/>
or <URL:http://www.example.com/> when embedding URLs in plain text.
URI::Find includes these in the URLs it returns. If strip_brackets is
true, this extra stuff will be removed and won't appear in the URIs returned
by Pod::Extract::URI.
$peu->parse_from_file( $filename );
Parses the POD from the specified file and stores the URIs it finds for later retrieval.
$peu->parse_from_filehandle( $filehandle );
Parses the POD from the filehandle and stores the URIs it finds for later retrieval.
my @uris = $peu->uris_from_file( $filename );
A shortcut for parse_from_file() then uris().
my @uris = $peu->uris_from_filehandle( $filename );
A shortcut for parse_from_filehandle() then uris().
my @uris = $peu->uris();
Returns a list of the URIs found from parsing.
my %details = $peu->uri_details();
Returns a hash of data about the URIs found.
The keys of the hash are the URIs (which match those returned by uris()).
The values of the hash are arrayrefs of hashrefs. Each hashref contains
The URI object returned by URI::Find.
The text returned by URI::Find, which will have the brackets stripped
from it if strip_brackets has been specified.
The original text returned by URI::Find.
The initial line number of the paragraph in which the URI was found.
The Pod::Paragraph object corresponding to the paragraph where the URI
was found.
You can specify URIs to ignore in your POD, using a =for stop_uris
command, e.g.
=for stop_uris www.foobar.com
These will be converted to patterns as if they had been passed in via
stop_uris() directly, and will apply from the point of the command
onwards.
Ian Malpass (ian-cpan@indecorous.com)
Copyright 2007, Ian Malpass
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
| Pod-Extract-URI documentation | Contained in the Pod-Extract-URI distribution. |
package Pod::Extract::URI; use strict; use warnings; use Carp; use URI::Find; use URI::Find::Schemeless; use Pod::Escapes; use base qw(Pod::Parser); our $VERSION = '0.3';
sub new { my $proto = shift; my $class = ref $proto || $proto; my %args = @_; # default arguments my %my_args = ( schemeless => 0, L_only => 0, want_textblock => 1, want_verbatim => 1, want_command => 1, schemes => [], exclude_schemes => [], stop_uris => [], stop_sub => sub { return 0 }, use_canonical => 0, strip_brackets => 1, ); # override defaults for my $arg ( keys %my_args ) { if ( exists $args{ $arg } ) { $my_args{ $arg } = $args{ $arg }; # remove arguments - anything left will be passed # to Pod::Parser delete $args{ $arg }; } } # instantiate Pod::Parser object # pass any leftover arguments my $self = $class->SUPER::new( %args ); $self->{ URIS } = {}; # URI details $self->{ URI_LIST } = []; # ordered URI list my $find_class = "URI::Find"; if ( $my_args{ schemeless } ) { $find_class = "URI::Find::Schemeless"; } delete $my_args{ schemeless }; # no schemeless() method # instantiate finder object with callback closure my $finder = $find_class->new( sub { $self->_register_uri( @_ ); } ); $self->_finder( $finder ); # call methods for remaining arguments for my $arg ( keys %my_args ) { $self->$arg( $my_args{ $arg } ); } return $self; } # process # Use the URI::Find object to find URIs. The URI::Find object has a callback # which will record any URIs it finds sub _process { my ( $self, $text ) = @_; $self->_finder->find( \$text ); } # textblock # Overrides Pod::Parser method, handling POD textblock events sub textblock { my ( $self, $text, $line, $para ) = @_; $self->_current_line( $line, $para ); # stash current line info for callback if ( $self->want_textblock() ) { # interpolate to get interior sequence expansion $text = $self->interpolate( $text, $line ); if ( ! $self->L_only ) { # interpolate() will sort out extraction for L<> if L_only is true $self->_process( $text, $line ); } } } # verbatim # Overrides Pod::Parser method, handling POD verbatim events sub verbatim { my ( $self, $text, $line, $para ) = @_; $self->_current_line( $line, $para ); if ( $self->want_verbatim() && ! $self->L_only() ) { # L<> not valid in verbatim blocks $self->_process( $text ); } } # command # Overrides Pod::Parser method, handling POD command events sub command { my ( $self, $cmd, $text, $line, $para ) = @_; $self->_current_line( $line, $para ); if ( $cmd eq "for" && index( $text, "stop_uris" ) == 0 ) { # We have a stop_uris directive - add them to the # list my @stop = @{ $self->stop_uris }; $text = substr( $text, 10 ); push @stop, split /\n/, $text; $self->stop_uris( \@stop ); } elsif ( $self->want_command() ) { # same logic as for textblock() $self->interpolate( $text, $line ); if ( ! $self->L_only() ) { $self->_process( $text ); } } } # interior_sequence # Overrides Pod::Parser method, handling POD interior_sequence events # Only gets called if we call interpolate() on the containing paragraph sub interior_sequence { my ( $self, $seq_cmd, $seq_arg, $pod_seq ) = @_; if ( $seq_cmd eq "L" && $self->L_only ) { # if we have an L<> sequence, process it $self->_process( $seq_arg ); } elsif ( $seq_cmd eq "E" ) { return Pod::Escapes::e2char( $seq_arg ); } return $seq_arg; } # _register_uri # Handle a URI when we find it sub _register_uri { my ( $self, $uri, $original_text ) = @_; my $text = $original_text; if ( $self->strip_brackets ) { $text =~ s/^<(URL:)?(.*)>$/$2/; } my $test_text = $text; my $uri_str = $text; if ( $self->use_canonical ) { # force to canonical form $uri = $uri->canonical; # looks like URI::Find already does this $uri_str = $uri->as_string; $test_text = $uri_str; } my $scheme = $uri->scheme(); # check the scheme and URL against the various discriminators my $include = $self->schemes; if ( scalar @$include && ! grep { $scheme eq $_ } @$include ) { return $text; } my $exclude = $self->exclude_schemes; if ( scalar @$exclude && grep { $scheme eq $_ } @$exclude ) { return $text; } my $stop = $self->stop_uris; if ( scalar @$stop && grep { $test_text =~ $_ } @$stop ) { return $text; } if ( $self->_check_stop_sub( $uri, $text ) ) { return $text; } my ( $line, $para ) = $self->_current_line(); if ( ! exists $self->{ URIS }->{ $uri_str } ) { $self->{ URIS }->{ $uri_str } = []; } push @{ $self->{ URIS }->{ $uri_str } }, { uri => $uri, text => $text, original_text => $original_text, line => $line, para => $para, }; push @{ $self->{ URI_LIST } }, $uri_str; return $text; } # _current_line # Store the current line and Pod::Paragraph object, as passed to the # Pod::Parser methods, so that _register_uri() can store them if # necessary. # Returns the current line in scalar context, and the current line and # Pod::Paragraph object in list context. sub _current_line { my ( $self, $line, $para ) = @_; if ( defined $line ) { $self->{ CURRENT_LINE } = $line; if ( defined $para ) { $self->{ CURRENT_PARA } = $para; } else { delete $self->{ CURRENT_PARA }; } } if ( wantarray ) { return ( $self->{ CURRENT_LINE }, $self->{ CURRENT_PARA } ); } else { return $self->{ CURRENT_LINE }; } } # _finder # Get/set the URI finder object sub _finder { my ( $self, $finder ) = @_; if ( defined $finder ) { $self->{ FINDER } = $finder; } return $self->{ FINDER }; }
sub L_only { my ( $self, $l_only ) = @_; if ( defined $l_only ) { $self->{ L_ONLY } = $l_only; } return $self->{ L_ONLY }; }
sub want_command { my ( $self, $command ) = @_; if ( defined $command ) { $self->{ WANT_COMMAND } = $command; } return $self->{ WANT_COMMAND }; }
sub want_textblock { my ( $self, $textblock ) = @_; if ( defined $textblock ) { $self->{ WANT_TEXTBLOCK } = $textblock; } return $self->{ WANT_TEXTBLOCK }; }
sub want_verbatim { my ( $self, $verbatim ) = @_; if ( defined $verbatim ) { $self->{ WANT_VERBATIM } = $verbatim; } return $self->{ WANT_VERBATIM }; }
sub schemes { my ( $self, $schemes ) = @_; if ( defined $schemes ) { if ( ref $schemes eq "ARRAY" ) { $self->{ SCHEMES } = $schemes; } else { carp "Argument to schemes() must be an arrayref"; } } return $self->{ SCHEMES }; }
sub exclude_schemes { my ( $self, $schemes ) = @_; if ( defined $schemes ) { if ( ref $schemes eq "ARRAY" ) { $self->{ EXCLUDE_SCHEMES } = $schemes; } else { carp "Argument to exclude_schemes() must be an arrayref"; } } return $self->{ EXCLUDE_SCHEMES }; }
sub stop_uris { my ( $self, $urls ) = @_; if ( defined $urls ) { if ( ref $urls eq "ARRAY" ) { my @urls = map { UNIVERSAL::isa( $_, "Regexp" ) ? $_ : qr/$_/ } @$urls; $self->{ STOP_URLS } = \@urls; } else { carp "Argument to stop_uris() must be an arrayref"; } } return $self->{ STOP_URLS }; }
sub stop_sub { my ( $self, $sub ) = @_; if ( defined $sub ) { if ( ref $sub eq "CODE" ) { $self->{ STOP_SUB } = $sub; } else { carp "Argument to stop_sub() must be a coderef"; } } return $self->{ STOP_SUB }; } # _check_stop_sub # Call the stop sub with the right arguments sub _check_stop_sub { my ( $self, $uri, $text ) = @_; my $sub = $self->{ STOP_SUB }; return &$sub( $uri, $text, $self ); }
sub use_canonical { my ( $self, $use ) = @_; if ( defined $use ) { $self->{ USE_CANONICAL } = $use; } return $self->{ USE_CANONICAL }; }
sub strip_brackets { my ( $self, $strip ) = @_; if ( defined $strip ) { $self->{ STRIP_BRACKETS } = $strip; } return $self->{ STRIP_BRACKETS }; }
sub uris_from_file { my ( $self, $file ) = @_; if ( ! ref $self ) { $self = $self->new(); } $self->parse_from_file( $file ); return $self->uris; }
sub uris_from_filehandle { my ( $self, $file ) = @_; if ( ! ref $self ) { $self = $self->new(); } $self->parse_from_filehandle( $file ); return @{ $self->{ URI_LIST } }; }
sub uris { my $self = shift; return @{ $self->{ URI_LIST } }; }
sub uri_details { my $self = shift; return %{ $self->{ URIS } }; }
1;