Search::Tools::Tokenizer - split a string into meaningful tokens


Search-Tools documentation Contained in the Search-Tools distribution.

Index


Code Index:

NAME

Top

Search::Tools::Tokenizer - split a string into meaningful tokens

SYNOPSIS

Top

 use Search::Tools::Tokenizer;
 my $tokenizer = Search::Tools::Tokenizer->new();
 my $tokens = $tokenizer->tokenize('quick brown red dog');
 while ( my $token = $tokens->next ) {
     # token isa Search::Tools::Token
     print "token = $token\n";
     printf("str: %s, len = %d, u8len = %d, pos = %d, is_match = %d, is_hot = %d\n",
        $token->str,
        $token->len, 
        $token->u8len, 
        $token->pos, 
        $token->is_match, 
        $token->is_hot
     );
 }

DESCRIPTION

Top

A Tokenizer object splits a string into Tokens based on a regex. Tokenizer is used primarily by the Snipper class.

METHODS

Top

Most of Search::Tools::Tokenizer is written in C/XS so if you view the source of this class you will not see much code. Look at the source for Tools.xs and search-tools.c if you are interested in the internals.

This class inherits from Search::Tools::Object. Only new or overridden methods are documented here.

init

Called by new().

re([ regex ])

Get/set the regex used by tokenize() tokenize_pp(). Typically you set this once in new(). The default value is:

 qr/\w+(?:'\w+)*/

which will match words and contractions (e.g., "do", "not" and "don't").

tokenize( string [, heat_seeker, match_num] )

Returns a TokenList object representin the Tokens in string. string is "split" according to the regex in re().

heat_seeker can be either a CODE reference or a regex object (qr//) to use for testing is_hot per token.

match_num is the parentheses number to consider the matching token in the re() value. The default is 0 (the entire matching pattern).

tokenize_pp( string )

Returns a TokenListPP object.

A pure-Perl implementation of tokenize(). Mostly written so you can see what the XS algorithm does, if you are so inclined, and so the author could benchmark the two implementations and thereby feel some satisfaction at having spent the time writing the XS/C version (2-3x faster than Perl).

get_offsets( string, regex )

Returns an array ref of pos() values for start offsets of regex within string

set_debug( n )

Sets the XS debugger on. By default, setting debug(1) (which is inherited from Search::Tools::Object) is not sufficient to trigger the XS debugging. Use set_debug() if you want lots of info on stderr.

AUTHOR

Top

Peter Karman <karman@cpan.org>

BUGS

Top

Please report any bugs or feature requests to bug-search-tools at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tools. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

Top

You can find documentation for this module with the perldoc command.

    perldoc Search::Tools




You can also look for information at:

* RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=Search-Tools

* AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/Search-Tools

* CPAN Ratings

http://cpanratings.perl.org/d/Search-Tools

* Search CPAN

http://search.cpan.org/dist/Search-Tools/

COPYRIGHT

Top


Search-Tools documentation Contained in the Search-Tools distribution.

package Search::Tools::Tokenizer;
use strict;
use warnings;
use base qw( Search::Tools::Object );
use Search::Tools;    # XS package required
use Search::Tools::Token;
use Search::Tools::TokenList;
use Search::Tools::UTF8;
use Carp;

our $VERSION = '0.59';

__PACKAGE__->mk_accessors(qw( re ));

sub init {
    my $self = shift;
    $self->SUPER::init(@_);
    $self->{re} ||= qr/\w+(?:[\'\-]\w+)*/;
    if ( $self->debug ) {
        $self->set_debug( $self->debug - 1 );    # XS debug a level higher
    }
    return $self;
}

sub tokenize_pp {
    require Search::Tools::TokenPP;
    require Search::Tools::TokenListPP;

    my $self = shift;
    if ( !defined $_[0] ) {
        croak "str required";
    }

    # XS modifies the original arg, so we do too.
    # this is same slight optimization XS does. ~5%
    if ( !is_ascii( $_[0] ) ) {
        $_[0] = to_utf8( $_[0] );
    }
    my $heat_seeker = $_[1];

    # match_num ($_[2]) not supported in PP

    my @heat   = ();
    my @tokens = ();
    my $i      = 0;
    my $re     = $self->{re};
    my $heat_seeker_is_coderef
        = ( defined $heat_seeker and ref($heat_seeker) eq 'CODE' ) ? 1 : 0;

    # TODO is_sentence_* logic
    for ( split( m/($re)/, $_[0] ) ) {
        next unless length($_);
        my $tok = bless(
            {   'pos'    => $i++,
                str      => $_,
                is_hot   => 0,
                is_match => 0,
                len      => byte_length($_),
                u8len    => length($_),
            },
            'Search::Tools::TokenPP'
        );
        if ( $_ =~ m/^$re$/ ) {
            $tok->{is_match} = 1;
            if ($heat_seeker_is_coderef) {
                $heat_seeker->($tok);
            }
            elsif ( defined $heat_seeker ) {
                $tok->{is_hot} = $_ =~ m/$heat_seeker/;
            }
        }
        push( @heat, $tok->{pos} ) if $tok->{is_hot};
        push @tokens, $tok;
    }
    return bless(
        {   tokens => \@tokens,
            num    => $i,
            'pos'  => 0,
            heat   => \@heat,
        },
        'Search::Tools::TokenListPP'
    );
}

1;

__END__