Lingua::JA::Summarize - A keyword extractor / summary generator


Lingua-JA-Summarize documentation  | view source Contained in the Lingua-JA-Summarize distribution.

Index


NAME

Top

Lingua::JA::Summarize - A keyword extractor / summary generator

SYNOPSIS

Top

    # Functional style

    use Lingua::JA::Summarize qw(:all);

    @keywords = keyword_summary('You need longer text to get keywords', {
        minwords => 3,
        maxwords => 5,
    });
    print join(' ', @keywords) . "\n";

    @keywords = file_keywords_summary('filename_to_analyze.txt', {
        minwords => 3,
        maxwords => 5,
    });
    print join(' ', @keywords) . "\n";

    # OO style

    use Lingua::JA::Summarize;

    $s = Lingua::JA::Summarize->new;

    $s->analyze('You need longer text to obtain keywords');
    $s->analyze_file('filename_to_analyze.txt');

    @keywords = $s->keywords({ minwords => 3, maxwords => 5 });
    print join(' ', @keywords) . "\n";




DESCRIPTION

Top

Lingua::JA::Summarize is a keyword extractor / summary generator for Japanese texts. By using MeCab, the module extracts keywords from Japanese texts.

CONSTRUCTOR

Top

new()
new({ params })

You may provide behaviour parameters through a hashref.

ex. new({ mecab => '/usr/local/mecab/bin/mecab' })

ANALYZING TEXT

Top

analyze($string)
analyze_file($filename)

Use either of the function to analyze text. The functions throw an error if failed.

OBTAINING KEYWORDS

Top

keywords($name)
keywords($name, { params })

Returns an array of keywords. Following parameters are available for controlling the output.

maxwords

Maximum number of keywords to be returned. The default is 5.

minwords

Minimum number of keywords to be returned. The default is 0.

threshold

Threshold for the calculated significance value to be treated as a keyword. The properties maxwords and minwords have precedence to this property.

CONTROLLING THE BEHAVIOUR

Top

Use the descibed member functions to control the behaviour of the analyzer.

alnum_as_word([boolean])

Sets or retrives a flag indicating whether or not, not to split a word consisting of alphabets and numerics. Also controls the splitting of apostrophies.

If set to false, "O'Reilly" would be treated as "o reilly", "30boxes" as "30 boxes".

The default is true.

default_cost([number])

Sets or retrieves the default cost applied for unknown words. The default is 1.0.

jaascii_as_word([boolean])

Sets or retrieves a flag indicating whether or not to consider consecutive ascii word and Japanese word as a single word. The default is true.

If set to true, strings like "ǧ¾Úapi" and "lamda´Ø¿ô" are treated as single words.

mecab([mecab_path])

Sets or retrieves mecab path. The default is "mecab".

ng([ng_words])

Sets or retrieves a hash array listing omitted words. Default hash is generated by Lingua::JA::Summarize::NG function.

omit_number([boolean])

Sets or retrieves a flag indicating whether or not to omit numbers.

singlechar_factor([number])

Sets or retrieves a factor value to be used for calculating weight of single-character words. The default is 0.5.

stats()

Returns list of statistics.

url_as_word([boolean])

Sets or retrieves a flag indicating whether or not to treat URLs as single words.

wordcount()

Returns number of the words analyzed.

CONTROLLING THE BEHAVIOUR GLOBALLY

Top

The default properties can be modified by setting %Lingua::JA::Summarize::LJS_Defaults or by setting environment variable with the property names uppercased and with LJS_ prefix.

For example, to set the mecab_charset property,

1) setting through perl

use Lingua::JA::Summarize qw(:all);

$LJS_Defaults{mecab_charset} = 'sjis' unless defined $LJS_Defaults{mecab_charset};

2) setting through environment variable

% LJS_MECAB_CHARSET=sjis perl -Ilib t/02-keyword.t

STATIC FUNCTIONS

Top

keyword_summary($text)
keyword_summary($text, { params })
file_keyword_summary($file)
file_keyword_summray($file, { params })

Given a text or a filename to analyze, returns an array of keywords. Either any properties described in the CONTROLLING THE BEHAVIOUR section or the parameters of the keywords member function could be set as parameters.

NG()

Returns a default hashref containing NG words.

AUTHOR

Top

Kazuho Oku <kazuhooku ___at___ gmail.com>

ACKNOWLEDGEMENTS

Top

Thanks to Takesako-san for writing the prototype.

COPYRIGHT

Top


Lingua-JA-Summarize documentation  | view source Contained in the Lingua-JA-Summarize distribution.