Lingua::ZH::Summary - Extract summary from Chinese text


Lingua-ZH-Summary documentation Contained in the Lingua-ZH-Summary distribution.

Index


Code Index:

NAME

Top

Lingua::ZH::Summary - Extract summary from Chinese text

VERSION

Top

Version 0.03

SYNOPSIS

Top

Given a Chinese text, it will return the summary. Unlike Lingua-ZH-Summarize, this module depends on term frequency instead of knowledge. If knowledge analysis is required, use Lingua::ZH::Summarize instead.

Perhaps a little code snippet.

    use Lingua::ZH::Summary;

    my $foo = Lingua::ZH::Summary->new();
    my $summary = $foo->summary(<FH> or $text);
    ...

METHODS

Top

new

FUNCTIONS

Top

summary

SEE ALSO

Top

Lingua::ZH::Toke, Lingua::ZH::Wrap, Lingua::EN::Summary

AUTHOR

Top

Cheng-Lung Sung, <clsung at cpan.org>

BUGS

Top

Please report any bugs or feature requests to bug-lingua-zh-summary at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lingua-ZH-Summary. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

Top

You can find documentation for this module with the perldoc command.

    perldoc Lingua::ZH::Summary

You can also look for information at:

* AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/Lingua-ZH-Summary

* CPAN Ratings

http://cpanratings.perl.org/d/Lingua-ZH-Summary

* RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=Lingua-ZH-Summary

* Search CPAN

http://search.cpan.org/dist/Lingua-ZH-Summary

ACKNOWLEDGEMENTS

Top

Reference to the Lingua::ZH::Summarize module from Audrey Tang <cpan@audreyt.org>.

COPYRIGHT & LICENSE

Top


Lingua-ZH-Summary documentation Contained in the Lingua-ZH-Summary distribution.
package Lingua::ZH::Summary;

use warnings;
use strict;
use utf8;

use Net::YASA;

our $VERSION = '0.03';

my %punct = map { $_ => $_ } qw(。 ? ! ; …);
$punct{$_} = '。' for qw(, -);

my %key  = map { $_ => 1 } qw(是 會 曾 將 能 有);
my %stop = (
    %key, map { $_ => 1 } qw(
	的 裡 和 與 及 年 月 日 時 分 秒 可 對 於 但 也 且 或 中 而 為 叫
    )
);

sub new {
    my $class = shift;
    my $self = {
	yasa => undef
    };
    if(@_) {
	my %arg = @_;

	foreach (keys %arg) {
	    $self->{lc($_)} = $arg{$_};
	}
    }
    $self->{yasa} = Net::YASA->new (minlength => 2);
    bless($self, $class);
    return($self);
}

sub summary {
    my $self = shift;
    my ($text, %options) = @_;

    return unless $text;

    # Strip whitespace and formatting out of the text.
    $text =~ s/^\s+//;
    $text =~ s/\s+/ /sg;
    $text =~ s/\s+$//;

    unless (exists $options{maxlength} and $options{maxlength} > 0) {
	$options{maxlength} = log(length($text)) * 15;
    }

    my $re = "(".join ("|",keys %punct).")";
    my @textlines = split /$re/,$text;

    # First we get the meaningful terms with respect to their term frequency

    my $summary = '';
    my $flag = 1;

    my $termset = $self->{yasa}->extract($text);

    # Get top 5
    my @terms = map {s/\t.*//; $_;} (@{$termset}[0..4]);
    $re = "(?:".join ("|",@terms).")";

    my $i=0;
    my $chunk;
    while (($chunk, my $punct) = (shift @textlines, shift @textlines) and defined($chunk)) {
	($flag = $punct{$punct}, next) unless $flag;
	$flag = $punct{$punct} or next;

	next unless length($chunk) > 5;
	next unless $chunk =~ /.+(?:\Q是\E|\Q會\E|\Q曾\E|\Q將\E|\Q能\E|\Q有\E|\Q為\E)/;
	next unless $chunk =~ /$re/;
	next if $stop{substr($chunk, 0, 1)} or $stop{substr($chunk, -1)};

	$summary .= $chunk . $punct{$punct};

	last if length($summary) >= $options{maxlength};
    }

    ### Done! Do any necessary postprocessing before returning.

    return $summary;
}

1; # End of Lingua::ZH::Summary