Search::Tools::TokenListUtils - mixin methods for TokenList and TokenListPP


Search-Tools documentation Contained in the Search-Tools distribution.

Index


Code Index:

NAME

Top

Search::Tools::TokenListUtils - mixin methods for TokenList and TokenListPP

SYNOPSIS

Top

 my $tokens = $tokenizer->tokenize( $string );
 if ( $tokens->str eq $string) {
    print "string is same, before and after tokenize()\n";
 }
 else {
    warn "I'm filing a bug report against Search::Tools right away!\n";
 }

 my ($start_pos, $end_pos) = $tokens->get_window( 5, 20 );
 # $start_pos probably == 0
 # $end_pos probably   == 25

 my $slice = $tokens->get_window_pos( 5, 20 );
 for my $token (@$slice) {
    print "token = $token\n";
 }

DESCRIPTION

Top

Search::Tools::TokenListUtils contains pure-Perl methods inhertited by both Search::Tools::TokenList and Search::Tools::TokenListPP.

METHODS

Top

str

Returns a serialized version of the TokenList. If you haven't altered the TokenList since you got it from tokenize(), then str() returns a scalar string identical to (but not the same) the string you passed to tokenize().

Both Search::Tools::TokenList and TokenListPP are overloaded to stringify to the str() value.

get_window( pos [, size, as_sentence] )

Returns array with two values: start and end positions for the array of length size on either side of pos. Like taking a slice of the TokenList.

Note that size is the number of tokens not matches. So if you're looking for the number of "words", think about size*2.

Note too that size is the number of tokens on one side of pos. So the entire window width (length of the returned slice) is size*2 +/-1. The window is guaranteed to be bounded by matches.

If as_sentence is true, the window is shifted to try and match the first token prior to pos that returns true for is_sentence_start().

get_window_tokens( pos [, size] )

Like get_window() but returns an array ref of a slice of the TokenList containing Tokens.

AUTHOR

Top

Peter Karman <karman@cpan.org>

BUGS

Top

Please report any bugs or feature requests to bug-search-tools at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tools. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

Top

You can find documentation for this module with the perldoc command.

    perldoc Search::Tools




You can also look for information at:

* RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=Search-Tools

* AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/Search-Tools

* CPAN Ratings

http://cpanratings.perl.org/d/Search-Tools

* Search CPAN

http://search.cpan.org/dist/Search-Tools/

COPYRIGHT

Top


Search-Tools documentation Contained in the Search-Tools distribution.
package Search::Tools::TokenListUtils;
use strict;
use warnings;
use Carp;

our $VERSION = '0.59';

sub str {
    my $self   = shift;
    my $joiner = shift(@_);
    if ( !defined $joiner ) {
        $joiner = '';
    }
    return join( $joiner, map {"$_"} @{ $self->as_array } );
}

sub get_window {
    my $self = shift;
    my $pos  = shift;
    if ( !defined $pos ) {
        croak "pos required";
    }

    my $size        = int(shift) || 20;
    my $as_sentence = shift      || 0;
    my $max_index   = $self->len - 1;

    if ( $pos > $max_index or $pos < 0 ) {
        croak "illegal pos value: no such index in TokenList";
    }

    #warn "window size $size for pos $pos";

    # get the $size tokens on either side of $tok
    my ( $start, $end );

    # is token too close to the top of the stack?
    if ( $pos > $size ) {
        $start = $pos - $size;
    }

    # is token too close to the bottom of the stack?
    if ( $pos < ( $max_index - $size ) ) {
        $end = $pos + $size;
    }
    $start ||= 0;
    $end   ||= $max_index;

    if ($as_sentence) {
        my $sentence_starts = $self->get_sentence_starts;

        # default to what we have.
        my $start_for_pos = $start;
        my $i             = 0;

        #warn "looking for sentence_start for start = $start end = $end\n";
        for (@$sentence_starts) {

            #warn " $_ [$i]\n";
            if ( $_ >= $pos ) {
                $start_for_pos = $sentence_starts->[$i];
                last;
            }
            $i++;
        }

        #warn "found $start_for_pos (start = $start end = $end)\n";
        if ( $start_for_pos != $start ) {
            if ( $start_for_pos < $start ) {
                $end -= ( $start - $start_for_pos );
            }
            else {
                $end += ( $start_for_pos - $start );
            }
            $start = $start_for_pos;
        }

        #warn "now $start_for_pos (start = $start end = $end)\n";
    }
    else {

        # make sure window starts and ends with is_match
        while ( !$self->get_token($start)->is_match ) {
            $start++;
        }
        while ( !$self->get_token($end)->is_match ) {
            $end--;
        }
    }

    #warn "return $start .. $end";
    #warn "$size ~~ " . ( $end - $start );

    return ( $start, $end );
}

sub get_window_tokens {
    my $self = shift;
    my ( $start, $end ) = $self->get_window(@_);
    my @slice = ();
    for ( $start .. $end ) {
        push( @slice, $self->get_token($_) );
    }
    return \@slice;
}

1;

__END__