Sitemapper Version 1.008

Description

sitemapper.pl is a simple perl script which generated an HTML site map from a given URL. It does this by traversing the site, getting the home page, extracting links from it, getting all the pages linked, and so on.

The default sitemap generated is an HTML bulleted list. The first level indented list item is the home page; the next level are all the pages linked from the home page. The next level are all the pages linked from each of these pages, and so on. If a page is linked from more than one page, it is show in the "highest" place in the tree it is linked from.

Alternative sitemap formats are:

sitemapper.pl should correctly deal with framesets, client side image maps, and <BASE> tags. It ignores all "off site" links - i.e. all absolute URLs that do not start with the original "base" URL of the home page.

Modules

sitemapper.pl includes two modules that it requires in its distribution:

WWW::Sitemap
LWP::AuthenAgent

WWW::Sitemap is the module that is used to generate the sitemap structure from which the various output formats are generated. The interface provides access to list of URLs for a site, and links from each of these URLs. It also supports a traverse method, which allows the caller to specify a callback, so that other formats of sitemap can be generated, or other sitemap related functionality implemented. See the documentation of this module for more details.

LWP::AuthenAgent is a simple subclass of the LWP::UserAgent module, which allows requests to be made for URLs that require autentication, by requiring the user to type the username / password information for the relevant realm. This information is stored in the LWP::AuthenAgent object, so that repeated requests to the same realm can be made without re-typing the authenication details (a bit like a web browser, in fact). tty echo is switched off for the password.

Installation

Just the basic Makefile.PL stuff; i.e.:

> perl Makefile.PL
> make
> make test
> make install

Usage

To use sitemapper.pl, just type:

./sitemapper.pl -url http://www.mysite.com/

to get output to stdout, or

./sitemapper.pl -url http://www.mysite.com/ -output mysitemap.html

to output to a file. Type

./sitemapper.pl -help

to get full usage instructions, or

.sitemapper.pl -doc

to output the pod documentation

Examples

example.html contains an example of sitemapper.pl output, for the Canon Research Europe Ltd Perl Pages (http://www.cre.canon.co.uk/perl/); i.e. by running:

./sitemapper.pl -o example.html -url http://www.cre.canon.co.uk/

example.js.html contains an example of a dynamic HMTL version of the site map for the CRE site. This is generated using Jef Pearlman's (jef@mit.edu) javascript Tree class.

http://developer.netscape.com/docs/examples/dynhtml/tree.html

Many thanks to Jef for allowing this to be distributed with sitemapper.pl! This is generated by running:

./sitemapper.pl -o example.js.html -url http://www.cre.canon.co.uk/ -format js

exampl.xml contains the output from:

./sitemapper.pl -o example.xml -url http://www.cre.canon.co.uk/ -format xml

The XML format for this file is pretty ad hoc - probably not of interest to anyone apart from me!

Finally, a plain text version can be generated using the -format text option; for example:

./sitemapper.pl -o example.txt -url http://www.cre.canon.co.uk/ -format text

CPAN Modules

sitemapper.pl uses the following CPAN modules, that need to be installed before it will work:

WWW::Robot
HTML::Summary
Digest::MD5
Date::Format
Getopt::Long
HTML::Entities
IO::File
LWP::UserAgent
URI::URL
Term::ReadKey

See http://www.perl.com/CPAN/ for details of how to download / install these modules.

Bugs

Please send any bugs / comments / suggestions to Ave.Wrigley@itn.co.uk