Data::Classifier - A tool for classifying data with regular expressions


Data-Classifier documentation Contained in the Data-Classifier distribution.

Index


Code Index:

NAME

Top

Data::Classifier - A tool for classifying data with regular expressions

SYNOPSIS

Top

    use strict;
    use warnings;

    use Data::Classifier;

    my $yaml = <<EOY;
    ---
    name: Root
    children:
        - name: BMW
          children:
              - name: Diesel
                match:
                      model: "d\$"
              - name: Sports
                match:
                      model: "i\$"
                      seats: 2
              - name: Really Expensive
                match:
                      model: "^M"
    EOY

    my $classifier = Data::Classifier->new(yaml => $yaml);
    my $attributes1 = { model => '325i', seats => 4 };
    my $class1 = $classifier->process($attributes1);
    my $attributes2 = { model => '535d', seats => 4 };
    my $class2 = $classifier->process($attributes2);
    my $attributes3 = { model => 'M3', seats => 2 };
    my $class3 = $classifier->process($attributes3);
    print "$attributes2->{model}: ", $class2->fqn, "\n";
    print "$attributes3->{model}: ", $class3->fqn, "\n";
    #no real sports car has 4 seats
    print "$attributes1->{model}: ", $class1->fqn, "\n";

OVERVIEW

Top

This module provides tools to classify sets of data contained in hashes against a predefined class hierarchy. Testing against a class is performed using regular expressions stored in the class hierarchy. It is also possible to modify the behavior of the system by subclassing and overloading a few methods.

Note that this module may not be particularly usefull on its own. It is designed to be used as a base class for implementing other systems, such as Config::BuildHelper.

USAGE

Top

Using this module involves creating an instance of the classifier object, passing the class hierarchy in via a YAML file, a YAML string, or prebuilt data structure, and any optional arguments:

    $classifier = Data::Classifier->new(file => 'classes.yaml', debug => 1);
    $classifier = Data::Classifier->new(yaml => $yaml_string);
    $classifier = Data::Classifier->new(tree => $hashref);

Class Definition File

The class definition file is a very specific tree format, normally stored in a YAML file. Each node of the tree is a map with the same set of keys, some of which are optional:

name

The textual name of the node being defined.

data (optional)

Extra data to be returned with classification results.

children (optional)

A sequence of nodes that exists under this node.

match (optional)

A map of keys to test against incomming data and regular expressions to apply to that data. For a match to be true, all items in the map must match the data.

Matching Semantics

By default, this class has very specific matching semantics. For a dataset to match a node, everything listed under the match definition must match the specified data. Additionally, a node which contains no match definition will have all of it's children searched but can never be a match itself.

Methods

$result = $classifier->process($attr)

Classify the data contained in the hash reference stored in $attr and return an instance of Data::Classifier::Result. See the documentation for that class for more information.

$classifier->dump

Return a textual representation of the class hierarchy stored in RAM.

More Information

The rest of this module is documented in Data::Classifier::Result, which you use to access the results of classification.

SUBCLASSING

Top

This class can be subclassed to change its behavior. The following methods are available for overloading:

$classifier->return_result($result)

This method is invoked by $classifier->process() when it needs to return a new instance of a result class. Simply return an instance of your class here, such as:

    sub return_result {
            my ($self, $result) = @_;
            return Data::Classifier::Result->new($result);
    }

$classifier->check_match($matchlist, $attributes)

This method is invoked by $classifier->recursive_match() at each node of the tree that contains a match attribute. The entire contents of the match attribute will be passed in as $matchlist and the hashref given to $classifier->process() will be passed in via $attributes. Return true to indicate a match and false to indicate no match.

$classifier->recursive_search($attributes, $node)

This method is invoked by $classifier->process() to recursively search the entire tree. If you need to change the semantics of how the classifier treats matches against nodes with out a match attribute, you would do that here.

IMPROVEMENTS

Top

Here are a few ideas for improvements to this class:

Data::Classifier::SQLTree

A class that stores it's tree in a SQL database, reconstructs it at startup, and passes it in using the tree argument to new.

AUTHORS

Top

This module was created and documented by Tyler Riddle <triddle@gmail.com>.

BUGS

Top

There are no known bugs at this time.

Please report any bugs or feature requests to bug-data-classifier@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Data::Classifier. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.


Data-Classifier documentation Contained in the Data-Classifier distribution.

package Data::Classifier;

our $VERSION = '0.01';

use strict;
use warnings;

use Carp qw(croak);
use YAML qw(LoadFile Load);

use Data::Classifier::Result;

#public interface
sub new {
	my ($class, %opts) = @_;
	my $self = {};

	bless($self, $class);

	if (defined($opts{file})) {
		$self->{TREE} = LoadFile($opts{file});
	} elsif (defined($opts{yaml})) {
		$self->{TREE} = Load($opts{yaml});
	} elsif (defined($opts{tree})) {
		$self->{TREE} = $opts{tree};
	} else {
		croak "You must specify one of file, yaml, or tree";
	}

	$self->{DEBUG} = $opts{debug};

	return $self;
}

sub process {
	my ($self, $attributes) = @_;
	my ($result, $result_class);

	$self->debug("starting classification");

	$self->{CLASS_STACK} = [];

	$self->recursive_search($attributes, $self->{TREE});

	$result = $self->{CLASS_STACK};

	$self->{CLASS_STACK} = undef;	

	$self->debug("classification done\n\n");

	return $self->return_result($result);
}

#this method should be overloaded by base classes to change the class
#that is returned by process
sub return_result {
	my ($self, $result) = @_;
	return Data::Classifier::Result->new($result);
}

sub dump {
	my ($self) = @_;

	return Dumper($self->{TREE});
}

#private interface
sub recursive_search {
	my ($self, $attributes, $node) = @_;
	my $name = $node->{name};
	my $matchmap = $node->{match};
	my $children = $node->{children};
	my $class_stack = $self->{CLASS_STACK};
	my $node_match = 0;
	my $recurse_match = 0;
	my $generic_match = 0;

	$self->debug("testing $name");

	push(@$class_stack, $node);
	
	if (! defined($matchmap)) {
		#no rules for this class, so we will be a member of it only if a lower
		#rule applies
		$node_match = 1;
		$generic_match = 1;
	} elsif (ref($matchmap) ne 'HASH') {
		$self->tree_error("match was not a map");
	} else {
		if ($self->check_match($matchmap, $attributes)) {
			$node_match = 1;
		}
	}

	if ($node_match) {
		#check the children for a more specific class
		$self->debug("looking at child classes");

		if (defined($children)) {
			if (ref($children) ne 'ARRAY') {
				$self->tree_error("children must be a sequence");
			}

			foreach my $child (@$children) {
				if ($self->recursive_search($attributes, $child)) {
					$recurse_match = 1;
					last;
				}
			}
		}
	}

	if ($generic_match && ! $recurse_match) {
		#didn't match a lower level class after a generic match, so this is really no match
		pop(@$class_stack);
		return 0;
	} elsif (! $node_match) {
		pop(@$class_stack);
		return 0;
	}

	return 1;
}

#only return true if everything in $matchlist matches the stuff in $attributes
sub check_match {
	my ($self, $matchlist, $attributes) = @_;
	my $match = 0;

	#no idea why this has to be here, but with out it, matches fail
	#for very odd reasons - not sure if it's my bug or a perl bug, but it's
	#very strange
	keys(%$matchlist);

	while(my ($attribute, $regex) = each(%$matchlist)) {
		my $to_test = $attributes->{$attribute};

		if (! defined($to_test)) {
			$self->debug("nothing to test");
			return 0;
		}

		if (! defined($regex)) {
			$self->debug("no regex");
			die "regex";
		}

		if ($to_test !~ m/$regex/) {
			$self->debug("match failure");
			return 0;
		}

		$self->debug("testing data $attribute $regex '$to_test'");	

		$match = 1;
	}

	if ($match) {
		$self->debug("success");

		return 1;
	}

	$self->debug("fell through with no matches");

	return 0;
}

sub tree_error {
	my ($self, $msg) = @_;
	my $class_stack = $self->{CLASS_STACK};
	my @names;

	foreach my $one (@$class_stack) {
		push(@names, $one->{name});
	}

	$self->debug("ERROR: Class tree was not consistent: $msg\n");
	$self->debug("\tClass path: ", join('::', @names), "\n");

	die "can not continue after class tree error";
}

sub debug {
	my ($self, $msg) = @_;
	print STDERR "DEBUG: $msg\n" if $self->{DEBUG};
}

1;

__END__