Lingua::ZH::Segment - Chinese Text Segmentation


Lingua-ZH-Segment documentation Contained in the Lingua-ZH-Segment distribution.

Index


Code Index:

NAME

Top

Lingua::ZH::Segment - Chinese Text Segmentation

VERSION

Top

This document describes version 0.01 of Lingua::ZH::Segment, released March 10, 2005.

SYNOPSIS

Top

    use Lingua::ZH::Segment;

    print segment('降龍18掌'); # 降 龍 18 掌




DESCRIPTION

Top

This module currently only break chinese text into single character (Chinese word), it will not break up any alphabet.

METHODS

Top

Currently, only segment is available.

SEE ALSO

Top

Encode::Guess

AUTHORS

Top

Cheng-Lung Sung <clsung@tw.freebsd.org>

KUDOS

Top

Hsin-Chan Chien for inspiring me about Encode::Guess.

COPYRIGHT

Top


Lingua-ZH-Segment documentation Contained in the Lingua-ZH-Segment distribution.
# $Id: Segment.pm 1211 2005-03-10 14:10:14Z clsung $

package Lingua::ZH::Segment;
use strict;

use Encode::Guess;
our @ISA    = qw(Exporter);
our @EXPORT = qw(segment);
our $VERSION	= '0.02';

sub segment { 
    my $word = shift;
    my $decoder = guess_encoding ($word, qw/ utf8 big5 /);
    $word = $decoder->decode($word);
    my @segs = split /([A-z|\d]+|\S)/, $word;
    $word = join " ",@segs;
    $word =~ s/\s{2,}/ /g;
    $word =~ s/(^\s|\s$)//g;
    $word = $decoder->encode($word);
    return $word;
}

sub CLONE { }
sub DESTROY { }

1;