URI::Sequin - Extract information from the URLs of Search-Engines


URI-Sequin documentation  | view source Contained in the URI-Sequin distribution.

Index


NAME

Top

URI::Sequin - Extract information from the URLs of Search-Engines

SYNOPSIS

Top



	use URI::Sequin qw/se_extract key_extract log_extract %log_types/;

	$url = &log_extract($line_from_log_file, 'NCSA');

	$log_types{'MyLogType'} = '^(.+?) -> .+$';
	$url = &log_extract($line_from_log_file, 'MyLogType');

	$keyword_string = &key_extract($url);

	($search_engine_name, $search_engine_url) = @{&se_extract($url)};




DESCRIPTION

Top

This module provides three tools to aid people trying to analyse Search-Engine URLs. It’s meant mainly for those who want to analyse referrer logs and pick out key information about site visitors, such as which Search-Engine and keywords they used to find the site.

The functions and globals provided (and exported by default) from this module are:

log_extract($log_line, 'Type')

This will pick out the referring URL from a line of a logfile. The 'type' can be one of the built in types or can be a user-created one. For more information, see %log_types below. This subroutine accepts a scalar, and returns a scalar.

key_extract($url)

This will try and determine the keywords used in $url. It accepts a scalar and returns a scalar. Should nothing be found, it returns an undefined value.

se_extract($url)

This will try and determine the name of the Search-Engine used and its URL. It accepts a scalar, and returns an array containing firstly the Search- Engine’s name and secondly the Search-Engine’s URL. Should the URL appear not to be from a Search Query, it returns a reference to an empty array.

%log_types

There are five built-in logfile types already in this hash. They are:

* IIS1 - Microsoft IIS 3.0 and 2.0
* IIS2 - Microsoft IIS4.0 (W3SVC format)
* NCSA - For APACHE, NETSCAPE and any other NCSA format logs
* ORW - O'Reilly WebSite format
* General - A generalised one that will work with most logfiles

It’s easy to add another one. Simply add a key to the hash, with a value that is a regex. Parenthesise the part that is the referring URL, as the script uses $1 to obtain the URL. (see the example in the Synopsis section).

I have only one request for people who use this module. *Please* tell me where and how you've used it, and if you have any thoughts or suggestions on it, tell me!

BUGS

Top

Doesn't like the Amnesi Search Engine. But then, neither do I. Also, the 'General' log type needs to be used with discretion ... be sure that none of the URLs contain literal " if you use it.

AUTHOR

Top

Peter Sergeant <pete@grou.ch>

COPYRIGHT

Top


URI-Sequin documentation  | view source Contained in the URI-Sequin distribution.