| Lingua-ZH-Summary documentation | Contained in the Lingua-ZH-Summary distribution. |
Lingua::ZH::Summary - Extract summary from Chinese text
Version 0.03
Given a Chinese text, it will return the summary. Unlike Lingua-ZH-Summarize, this module depends on term frequency instead of knowledge. If knowledge analysis is required, use Lingua::ZH::Summarize instead.
Perhaps a little code snippet.
use Lingua::ZH::Summary;
my $foo = Lingua::ZH::Summary->new();
my $summary = $foo->summary(<FH> or $text);
...
Lingua::ZH::Toke, Lingua::ZH::Wrap, Lingua::EN::Summary
Cheng-Lung Sung, <clsung at cpan.org>
Please report any bugs or feature requests to
bug-lingua-zh-summary at rt.cpan.org, or through the web interface at
http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lingua-ZH-Summary.
I will be notified, and then you'll automatically be notified of progress on
your bug as I make changes.
You can find documentation for this module with the perldoc command.
perldoc Lingua::ZH::Summary
You can also look for information at:
Reference to the Lingua::ZH::Summarize module from Audrey Tang <cpan@audreyt.org>.
Copyright 2007-2009 Cheng-Lung Sung, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
| Lingua-ZH-Summary documentation | Contained in the Lingua-ZH-Summary distribution. |
package Lingua::ZH::Summary; use warnings; use strict; use utf8; use Net::YASA;
our $VERSION = '0.03';
my %punct = map { $_ => $_ } qw(ã ï¼ ï¼ ï¼ â¦); $punct{$_} = 'ã' for qw(ï¼ ï¼); my %key = map { $_ => 1 } qw(æ¯ æ æ¾ å° è½ æ); my %stop = ( %key, map { $_ => 1 } qw( ç 裡 å è å å¹´ æ æ¥ æ å ç§ å¯ å° æ¼ ä½ ä¹ ä¸ æ ä¸ è çº å« ) );
sub new { my $class = shift; my $self = { yasa => undef }; if(@_) { my %arg = @_; foreach (keys %arg) { $self->{lc($_)} = $arg{$_}; } } $self->{yasa} = Net::YASA->new (minlength => 2); bless($self, $class); return($self); }
sub summary { my $self = shift; my ($text, %options) = @_; return unless $text; # Strip whitespace and formatting out of the text. $text =~ s/^\s+//; $text =~ s/\s+/ /sg; $text =~ s/\s+$//; unless (exists $options{maxlength} and $options{maxlength} > 0) { $options{maxlength} = log(length($text)) * 15; } my $re = "(".join ("|",keys %punct).")"; my @textlines = split /$re/,$text; # First we get the meaningful terms with respect to their term frequency my $summary = ''; my $flag = 1; my $termset = $self->{yasa}->extract($text); # Get top 5 my @terms = map {s/\t.*//; $_;} (@{$termset}[0..4]); $re = "(?:".join ("|",@terms).")"; my $i=0; my $chunk; while (($chunk, my $punct) = (shift @textlines, shift @textlines) and defined($chunk)) { ($flag = $punct{$punct}, next) unless $flag; $flag = $punct{$punct} or next; next unless length($chunk) > 5; next unless $chunk =~ /.+(?:\Qæ¯\E|\Qæ\E|\Qæ¾\E|\Qå°\E|\Qè½\E|\Qæ\E|\Qçº\E)/; next unless $chunk =~ /$re/; next if $stop{substr($chunk, 0, 1)} or $stop{substr($chunk, -1)}; $summary .= $chunk . $punct{$punct}; last if length($summary) >= $options{maxlength}; } ### Done! Do any necessary postprocessing before returning. return $summary; }
1; # End of Lingua::ZH::Summary