Lingua::CJK::Tokenizer - CJK Tokenizer


Lingua-CJK-Tokenizer documentation Contained in the Lingua-CJK-Tokenizer distribution.

Index


Code Index:

NAME

Top

Lingua::CJK::Tokenizer - CJK Tokenizer

SYNOPSIS

Top

    my $tknzr = Lingua::CJK::Tokenizer->new();
    $tknzr->ngram_size(5);
    $tknzr->max_token_count(100);
    $tokens_ref = $tknzr->tokenize("CJK Text");
    $tokens_ref = $tknzr->segment("CJK Text");
    $tokens_ref = $tknzr->split("CJK Text");
    $flag = $tknzr->has_cjk("CJK Text");
    $flag = $tknzr->has_cjk_only("CJK Text");

DESCRIPTION

Top

This module tokenizes CJK texts into n-grams.

METHODS

Top

ngram_size

sets the size of returned n-grams

max_token_count

sets the limit on the number of returned n-grams in case input text is too long or of indefinite size

tokenize

tokenizes texts into n-grams

segment

cuts cjk texts into chunks

split

tokenizes texts into uni-grams.

has_cjk

returns true if text has cjk characters

has_cjk_only

returns true if text has only cjk characters

PREREQUISITE

Top

This module requires libunicode by Tom Tromey.

COPYRIGHT

Top


Lingua-CJK-Tokenizer documentation Contained in the Lingua-CJK-Tokenizer distribution.

package Lingua::CJK::Tokenizer;

use strict;
use XSLoader;

XSLoader::load 'Lingua::CJK::Tokenizer';

1;
__END__