HTML::RobotsMETA - Parse HTML For Robots Exclusion META Markup


HTML-RobotsMETA documentation Contained in the HTML-RobotsMETA distribution.

Index


Code Index:

NAME

Top

HTML::RobotsMETA - Parse HTML For Robots Exclusion META Markup

SYNOPSIS

Top

  use HTML::RobotsMETA;
  my $p = HTML::RobotsMETA->new;
  my $r = $p->parse_rules($html);
  if ($r->can_follow) {
    # follow links here!
  } else {
    # can't follow...
  }

DESCRIPTION

Top

HTML::RobotsMETA is a simple HTML::Parser subclass that extracts robots exclusion information from meta tags. There's not much more to it ;)

DIRECTIVES

Top

Currently HTML::RobotsMETA understands the following directives:

ALL
NONE
INDEX
NOINDEX
FOLLOW
NOFOLLOW
ARCHIVE
NOARCHIVE
SERVE
NOSERVE
NOIMAGEINDEX
NOIMAGECLICK

METHODS

Top

new

Creates a new HTML::RobotsMETA parser. Takes no arguments

parse_rules

Parses an HTML string for META tags, and returns an instance of HTML::RobotsMETA::Rules object, which you can use in conditionals later

parser

Returns the HTML::Parser instance to use.

get_parser_callbacks

Returns callback specs to be used in HTML::Parser constructor.

TODO

Top

Tags that specify the crawler name (e.g. <META NAME="Googlebot">) are not handled yet.

There also might be more obscure directives that I'm not aware of.

AUTHOR

Top

Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>

SEE ALSO

Top

HTML::RobotsMETA::Rules HTML::Parser

LICENSE

Top

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html


HTML-RobotsMETA documentation Contained in the HTML-RobotsMETA distribution.

# $Id: /mirror/perl/HTML-RobotsMETA/trunk/lib/HTML/RobotsMETA.pm 4223 2007-10-29T06:42:26.630870Z daisuke  $
# 
# Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>
# All rights reserved.

package HTML::RobotsMETA;
use strict;
use warnings;
use HTML::Parser;
use HTML::RobotsMETA::Rules;
our $VERSION = '0.00004';

sub new
{
    my $class = shift;
    my $self  = bless {}, $class;
    return $self;
}

sub parser
{
    my $self = shift;
    return $self->{parser} ||= HTML::Parser->new(
        api_version => 3,
        $self->get_parser_callbacks
    );
}

sub get_parser_callbacks
{
    my $self = shift;
    return (
        start_h => [ sub { $self->_parse_start_h(@_) }, "tagname, attr" ]
    );
}

sub parse_rules
{
    my $self = shift;

    my @rules;
    local $self->{rules} = \@rules;

    my $parser = $self->parser();
    
    $parser->parse(@_);
    $parser->eof;

    # merge rules that were found in this document
    my %directives = (map { %$_ } @rules);
    return HTML::RobotsMETA::Rules->new(%directives);
}

sub _parse_start_h
{
    my ($self, $tag, $attr) = @_;

    return unless $tag eq 'meta';

    # the "name" attribute may contain either "robots", or user-specified
    # robot name, which is specific to a particular crawler
    # XXX - Handle the specific agent part later
    return unless defined $attr->{name} && $attr->{name} =~ /^robots$/;

    my %directives;
    # Allowed values
    #   FOLLOW
    #   NOFOLLOW
    #   INDEX
    #   NOINDEX
    #   ARCHIVE
    #   NOARCHIVE
    #   SERVE
    #   NOSERVER
    #   NOIMAGEINDEX
    #   NOIMAGECLICK
    #   ALL
    #   NONE
    my $content = lc $attr->{content};
    while ($content =~ /((?:no)?(follow|index|archive|serve)|(?:noimage(?:index|click))|all|none)/g) {
        $directives{$1}++;
    }

    push @{$self->{rules}}, \%directives;
}

1;

__END__