Lingua::ZH::Summarize - Summarizing bodies of Chinese text


Lingua-ZH-Summarize documentation Contained in the Lingua-ZH-Summarize distribution.

Index


Code Index:

NAME

Top

Lingua::ZH::Summarize - Summarizing bodies of Chinese text

SYNOPSIS

Top

    use Lingua::ZH::Summarize;

    print summarize( $text );                    # Easy, no? :-)
    print summarize( $text, maxlength => 500 );  # 500-byte summary
    print summarize( $text, wrap => 75 );        # Wrap output to 75 col.

DESCRIPTION

Top

This is a simple module which makes an unscientific effort at summarizing Chinese text. It recognizes simple patterns which look like statements, abridges them, and concatenates them into something vaguely resembling a summary. It needs more work on large bodies of text, but it seems to have a decent effect on small inputs at the moment.

Lingua::ZH::Summarize exports one function, summarize(), which takes the text to summarize as its first argument, and any number of optional directives in name => value form. The options it'll take are:

maxlength

Specifies the maximum length, in bytes, of the generated summary.

wrap

Prettyprints the summary output by wrapping it to the number of columns which you specify. This requires the Lingua::ZH::Wrap module.

Needless to say, this is a very simple and not terribly universally effective scheme, but it's good enough for a first draft, and I'll bang on it more later. Like I said, it's not a scientific approach to the problem, but it's better than nothing.

SEE ALSO

Top

Lingua::ZH::Toke, Lingua::ZH::Wrap, Lingua::EN::Summarize

ACKNOWLEDGEMENTS

Top

Algorithm adapted from the Lingua::EN::Summarize module by Dennis Taylor, <dennis@funkplanet.com>.

AUTHORS

Top

Autrijus Tang <autrijus@autrijus.org>

COPYRIGHT

Top


Lingua-ZH-Summarize documentation Contained in the Lingua-ZH-Summarize distribution.
# $File: //member/autrijus/Lingua-ZH-Summarize/Summarize.pm $ $Author: autrijus $
# $Revision: #2 $ $Change: 3687 $ $DateTime: 2003/01/20 07:22:40 $

package Lingua::ZH::Summarize;
$Lingua::ZH::Summarize::VERSION = '0.01';

use strict;
use vars qw($VERSION @ISA @EXPORT);

use Exporter;
use Lingua::ZH::Toke;

@ISA     = qw(Exporter);
@EXPORT  = qw(summarize);

my %punct = map { $_ => $_ } qw(¡C ¡H ¡I ¡F ...);
$punct{$_} = '¡C' for qw(¡A ¡Ð);

my %key  = map { $_ => 1 } qw(¬O ·| ´¿ ±N ¯à ¦³);
my %stop = (
    %key, map { $_ => 1 } qw(
	ªº ¸Ì ©M »P ¤Î ¦~ ¤ë ¤é ®É ¤À ¬í ¥i ¹ï ©ó ¦ý ¤] ¥B ©Î ¤¤ ¦Ó ¬° ¥s
    )
);

my $is_utf8;
sub import {
    my $class = shift;

    if ($_[0] eq 'utf8' and !$is_utf8++) {
	Lingua::ZH::Toke->import(@_);

	require Encode;
	%key   = map { Encode::decode( big5 => $_ ) => 1 } keys %key;
	%stop  = map { Encode::decode( big5 => $_ ) => 1 } keys %stop;
	%punct = map {
	    Encode::decode( big5 => $_ ) =>
	    Encode::decode( big5 => $punct{$_} )
	} keys %punct;
    }
}

sub summarize {
    my ($text, %options) = @_;

    # Run each filter over the text.
    return unless $text;

    # Strip whitespace and formatting out of the text.
    $text =~ s/^\s+//;
    $text =~ s/\s+/ /sg;
    $text =~ s/\s+$//;

    unless (exists $options{maxlength} and $options{maxlength} > 0) {
	$options{maxlength} = log(length($text)) * 30;
    }

    ### Here's where the interesting logic happens.

    my $sentence = Lingua::ZH::Toke->new($text);

    # First we break it into sentence pieces. Kind of. Sort of.

    my $summary = '';
    my $flag = 1;

    <$sentence> unless $sentence->[0][0];

    while (my ($chunk, $punct) = (scalar <$sentence>, scalar <$sentence>)) {
	($flag = $punct{$punct}, next) unless $flag;
	$flag = $punct{$punct} or next;

	next unless length($chunk) > 10;
	next unless $chunk =~ /.+(?:\Q¬O\E|\Q·|\E|\Q´¿\E|\Q±N\E|\Q¯à\E|\Q¦³\E)/;
	next if $stop{substr($chunk, 0, 2)} or $stop{substr($chunk, -2)};

	$summary .= $chunk . $punct{$punct};

	last if length($summary) >= $options{maxlength};
    }

    ### Done! Do any necessary postprocessing before returning.

    return $summary unless $options{wrap};

    # Prettyprint the summary to make it look nice on a terminal, if requested.

    require Lingua::ZH::Wrap;

    $summary = Encode::encode(big5 => $summary) if $is_utf8;
    $summary = Lingua::ZH::Wrap::wrap(
	$summary, $options{wrap} || 72, 1
    );
    $summary = Encode::decode(big5 => $summary) if $is_utf8;

    return $summary;
}

1;