| Lingua-CJK-Tokenizer documentation | Contained in the Lingua-CJK-Tokenizer distribution. |
Lingua::CJK::Tokenizer - CJK Tokenizer
my $tknzr = Lingua::CJK::Tokenizer->new();
$tknzr->ngram_size(5);
$tknzr->max_token_count(100);
$tokens_ref = $tknzr->tokenize("CJK Text");
$tokens_ref = $tknzr->segment("CJK Text");
$tokens_ref = $tknzr->split("CJK Text");
$flag = $tknzr->has_cjk("CJK Text");
$flag = $tknzr->has_cjk_only("CJK Text");
This module tokenizes CJK texts into n-grams.
sets the size of returned n-grams
sets the limit on the number of returned n-grams in case input text is too long or of indefinite size
tokenizes texts into n-grams
cuts cjk texts into chunks
tokenizes texts into uni-grams.
returns true if text has cjk characters
returns true if text has only cjk characters
This module requires libunicode by Tom Tromey.
Copyright (c) 2009 Yung-chung Lin.
This program is free software; you can redistribute it and/or modify it under the MIT License.
| Lingua-CJK-Tokenizer documentation | Contained in the Lingua-CJK-Tokenizer distribution. |
package Lingua::CJK::Tokenizer; use strict; use XSLoader; XSLoader::load 'Lingua::CJK::Tokenizer'; 1; __END__