| Algorithm-SVMLight documentation | view source | Contained in the Algorithm-SVMLight distribution. |
Algorithm::SVMLight - Perl interface to SVMLight Machine-Learning Package
use Algorithm::SVMLight;
my $s = new Algorithm::SVMLight;
$s->add_instance
(attributes => {foo => 1, bar => 1, baz => 3},
label => 1);
$s->add_instance
(attributes => {foo => 2, blurp => 1},
label => -1);
... repeat for several more instances, then:
$s->train;
# Find results for unseen instances
my $result = $s->predict
(attributes => {bar => 3, blurp => 2});
This module implements a perl interface to Thorsten Joachims' SVMLight package:
SVMLight is an implementation of Vapnik's Support Vector Machine [Vapnik, 1995] for the problem of pattern recognition, for the problem of regression, and for the problem of learning a ranking function. The optimization algorithms used in SVMlight are described in [Joachims, 2002a ]. [Joachims, 1999a]. The algorithm has scalable memory requirements and can handle problems with many thousands of support vectors efficiently.
-- http://svmlight.joachims.org/
Support Vector Machines in general, and SVMLight specifically, represent some of the best-performing Machine Learning approaches in domains such as text categorization, image recognition, bioinformatics string processing, and others.
For efficiency reasons, the underlying SVMLight engine indexes features by integers, not
strings. Since features are commonly thought of by name (e.g. the
words in a document, or mnemonic representations of engineered
features), we provide in Algorithm::SVMLight a simple mechanism for
mapping back and forth between feature names (strings) and feature
indices (integers). If you want to use this mechanism, use the
add_instance() and predict() methods. If not, use the
add_instance_i() (or read_instances()) and predict_i()
methods.
For installation instructions, please see the README file included with this distribution.
Creates a new Algorithm::SVMLight object and returns it. Any named
arguments that correspond to SVM parameters will cause their
corresponding set_***() method to be invoked:
$s = Algorithm::SVMLight->new(
type => 2, # Regression model
biased_hyperplane => 0, # Nonbiased
kernel_type => 3, # Sigmoid
);
See the set_***(...) method for a list of such parameters.
The following parameters can be set by using methods with their
corresponding names - for instance, the maxiter parameter can be
set by using set_maxiter($x), where $x is the new desired value.
Learning parameters:
type
svm_c
eps
svm_costratio
transduction_posratio
biased_hyperplane
sharedslack
svm_maxqpsize
svm_newvarsinqp
kernel_cache_size
epsilon_crit
epsilon_shrink
svm_iter_to_shrink
maxiter
remove_inconsistent
skip_final_opt_check
compute_loo
rho
xa_depth
predfile
alphafile
Kernel parameters:
kernel_type
poly_degree
rbf_gamma
coef_lin
coef_const
custom
For an explanation of these parameters, you may be interested in looking at the svm_common.h file in the SVMLight distribution.
It would be a good idea if you only set these parameters via arguments
to new() (see above) or right after calling new(), since I don't
think the underlying C code expects them to change in the middle of a
process.
Adds a training instance to the set of instances which will be used to
train the model. An attributes parameter specifies a hash of
attribute-value pairs for the instance, and a label parameter
specifies the label. The label must be a number, and typically it
should be 1 for positive training instances and -1 for negative
training instances. The keys of the attributes hash should be
strings, and the values should be numbers (the values of each attribute).
All training instances share the same attribute-space; if an attribute is unspecified for a certain instance, it is equivalent to specifying a value of zero. Typically you can save a lot of memory (and potentially training time) by omitting zero-valued attributes.
Each training instance may have a "cost factor" assigned to it,
indicating the relative cost of misclassification of the instance.
The default is a cost of 1.0; to assign a different cost, pass a
cost_factor parameter with the desired value.
When using a ranking SVM, you may also pass a query_id parameter,
whose integer value will identify the group of instances in which this
instance belongs for ranking purposes.
Finally, a slack_id parameter may also be passed and it will become
the slackid member of the underlying DOC C struct, used in an
"OPTIMIZATION" SVM (type==4).
This is just like add_instance(), but bypasses all the
string-to-integer mapping of feature names. Use this method when you
already have your features represented as integers. The $label
parameter must be a number (typically 1 or -1), and the
@indices and @values arrays must be parallel arrays of indices
and their corresponding values. Furthermore, the indices must be
positive integers and given in strictly increasing order.
If you like add_instance_i(), I've got a predict_i() I bet
you'll just love.
An alternative to calling add_instance_i() for each instance is to
organize a collection of training data into SVMLight's standard
"example_file" format, then call this read_instances() method to
import the data. Under the hood, this calls SVMLight's
read_documents() C function. When it's convenient for you to
organize the data in this manner, you may see speed improvements.
When using a ranking SVM, it is possible to customize the cost of ranking each pair of instances incorrectly by supplying a custom Perl callback function.
For two instances i and j, the custom function will receive four
arguments: the rankvalue of instance i and j, and the
costfactor of instance i and j. It should return a real
number indicating the cost.
By default, SVMLight will use an internal C function assigning a cost
of the average of the costfactors for the two instances.
After a sufficient number of instances have been added to your model,
call train() in order to actually learn the underlying
discriminative Machine Learning model.
Depending on the number of instances (and to a lesser extent the total
number of attributes), this method might take a while. If you want to
train the model only once and save it for later re-use in a different
context, see the write_model() and read_model() methods.
Returns a boolean value indicating whether or not train() has been
called on this model.
After train() has been called, the model may be applied to
previously-unseen combinations of attributes. The predict() method
accepts an attributes parameter just like add_instance(), and
returns its best prediction of the label that would apply to the given
attributes. The sign of the returned label (positive or negative)
indicates whether the new instance is considered a positive or
negative instance, and the magnitude of the label corresponds in some
way to the confidence with which the model is making that assertion.
This is just like predict(), but bypasses all the string-to-integer
mapping of feature names. See also add_instance_i().
Saves the given trained model to the file $file. The model may
later be re-loaded using the read_model() method. The model is
written using SVMLight's write_model() C function, so it will be
fully compatible with SVMLight command-line tools like
svm_classify.
Reads a model that has previously been written with write_model():
my $m = Algorithm::SVMLight->new(); $m->read_model($file);
The model file is read using SVMLight's read_model() C function, so
if you want to, you could initially create the model with one of
SVMLight's command-line tools like svm_learn.
After training a linear model (or reading in a model file), this method will return a reference to an array containing the linear weights of the model. This can be useful for model inspection, to see which features are having the greatest impact on decision-making.
my $arrayref = $m->get_linear_weights();
The first element (position 0) of the array will be the threshold
b, and the rest of the elements will be the weights themselves.
Thus from 1 upward, the indices align with SVMLight's internal
indices.
If the model has not yet been trained, or if the kernel type is not linear, an exception will be thrown.
Returns a list of feature names that have been fed to
add_instance() as keys of the attribute parameter, or in a
scalar context the number of such names.
Returns the number of features known to this model. Note that if you
use add_instance_i() or read_instances(), some of the features
may never actually have been seen before, because you could add
instances with only indices 2, 5, and 37, never having added any
instances with the indices in between, but num_features() will
return 37 in this case. This is because after training, an instance
could be passed to the predict() method with real values for these
previously unseen features. If you just use add_instance()
instead, you'll probably never run into this issue, and in a scalar
context num_features() will look just like feature_names().
Returns the number of training instances known to the model. It should be fine to call this method either before or after training actually occurs.
Ken Williams, <kwilliams@cpan.org>
The Algorithm::SVMLight perl interface is copyright (C) 2005-2008
Thomson Legal & Regulatory, and written by Ken Williams. It is free
software; you can redistribute it and/or modify it under the same
terms as perl itself.
Thorsten Joachims and/or Cornell University of Ithaca, NY control the
copyright of SVMLight itself - you will find full copyright and
license information in its distribution. You are responsible for
obtaining an appropriate license for SVMLight if you intend to use
Algorithm::SVMLight. In particular, please note that SVMLight "is
granted free of charge for research and education purposes. However
you must obtain a license from the author to use it for commercial
purposes."
To avoid any copyright clashes, the SVMLight.patch file distributed here is granted under the same license terms as SVMLight itself.
| Algorithm-SVMLight documentation | view source | Contained in the Algorithm-SVMLight distribution. |