| Text-Document documentation | view source | Contained in the Text-Document distribution. |
Text::Bloom - Evaluate Bloom signature of a set of terms
my $b = Text::Bloom->new(); $b->Compute( qw( foo bar baz ) ); my $sig = $b->WriteToString(); $b->WriteToFile( 'afile.sig' ); my $b2 = Text::Bloom::NewFromFile( 'afile.sig' ); my $b3 = Text::Bloom->new(); $b3->Compute( qw( foo bar barbaz ) ); my $sim = $b->Similarity( $b2 ); my $b4 = Text::Bloom::NewFromString( $sig );
Text::Bloom applies the Bloom filtering technique to
the statistical analysis of documents.
The terms in the document are quantized using a base-36 radix representation; each term thus corresponds to an integer in the range 0..p-1, where p is a prime, currently set to the greatest prime less than 2^32.
Each quantized value is mapped to d integers in the range
0..size-1, where size is an integer less than p,
currently 2^17, using a family of hash functions,
computed by the HashV function.
Each hashed value is used as the index in a large bit vector. Bits corresponding to terms present in the document are set to 1; all other bits are set to 0.
Of course, collisions may cause the same bit to be set twice, by different terms. It follows that, if the document contains n distinct terms, in the resulting bit vector at most n * d bits are set to 1.
The resulting bit string is a very compact representation of the presence/absence of terms in the document, and is therefore characterised as a signature. Moreover, it does not depend on a pre-set dictionary of terms.
The signature may be used for:
The bit representation may be written to and read from a file.
Text::Bloom prepends a header to the bit stream proper;
moreover, whenever the package Compress::Zlib is available,
the bit vector is compressed, so that disk space requirements
are drastically reduced, especially for small documents.
The hash function is obviously a crucial component of the filter;
the reference implementation uses a radix representation of
strings. Each term must therefore match the regular
expression /[0-9a-z]+/.
There are quite a few viable alternatives, which can be pursued
by subclassing and redefining the method QuantizeV.
The package may be {re}used either by simple instantiation,
or by subclassing (defining a descendant package). In the
latter case the methods which are foreseen to be redefined are
those ending with a V suffix. Redefining other methods
will require greater attention.
The constructor. No arguments are required.
$b = Text::Bloom->new();
Take a string written by WriteToString (see below)
and create a new Text::Bloom with the same contents;
call die whenever the restore is impossible or ill-advised,
for instance when the current version of the package is different
from the original one, or the compression library in unavailable.
my $b = Text::Bloom::NewFromString( $str );
The return value is a blessed reference; put in another way, this is an alternative contructor.
The string should have been written by WriteToString;
you may of course tweak the string contents, but
at this point you're entirely on you own.
Utility function that reads a binary file and performs a NewFromString
on its content; see its counterpart, WriteToFile.
my $b2 = Text::Document::NewFromFile( 'foo.sig' );
Set and get the size of the filter, in bits. The default size is currently 128K.
print 'size is ' . $b->Size() . "\n"; $b->Size( 65536 );
The Size method must be called before the Compute method
in order to have effect.
Compute the Bloom signature from the given set of words and store it internally.
$b->Compute( qw( foo bar baz foobar bazbaz ) );
Makes use of the QuantizeV method.
Convert a term into an integer; must return
an integer in the range 0 .. $Text::Bloom::p-1.
It is called as
my $hash = $b->QuantizeV( $term );
The current version is designed for strings matching
/[a-z0-9]+/. Other characters do not cause errors,
but degrade the hash function performance.
This function is a likely candidate for redefinition.
Convert an integer to a (smaller) integer, according to one of a class of similar functions.
It is internally called as:
my $index = $b->HashV( $order, $value );
The $value must belong to the interval
0..$Text::Bloom::p-1, while the index must
lie in 0..size-1. $order is
a small integer from 0 to d-1.
The default implementation is
index = m[order] * value + q[order] (mod size)
the values of m and q are taken from the array
@Text::Bloom::hashParam; the form of the function
is taken from [2].
Convert the Bloom signature into a string which can be saved and
later restored with NewFromString. Compute must have
been called previously.
my $str = $b->WriteToString();
The string begins with a header which encodes the originating package, its version, the parameters of the current instance.
Whenever possible, Compress::Zlib is used in order to
compress the bit vector in the most efficient way.
On systems without Compress::Zlib, the bit string is
saved uncompressed.
These convenience functions just call their String counterparts and read/write the file specified in the argument.
$b->WriteToFile( 'foo.sig' );
spinellia@acm.org (Andrea Spinelli) walter@humans.net (Walter Vannini)
Burton H. Bloom, "Space/time trade-offs in hash coding with allowable errors", Communications of the ACM, 13, 7, July 1970, pages 422-426. (available electronically from ACM Digital Library).
M. V. Ramakrishna, "Practical Performance of Bloom FIlters and Parallel Free-Text Searching", Communications of the ACM, 32, 10, October 1989, pages 1237-1239. (available electronically from ACM Digital Library).
On Win32 we have experienced some instabilities when dealing with a large number of signatures; in this case Perl crashes without apparent explanation. The main suspect is Bit::Vector, but without any evidence.
2001-11-02 - initial revision
| Text-Document documentation | view source | Contained in the Text-Document distribution. |