| Lingua-ZH-Segment documentation | Contained in the Lingua-ZH-Segment distribution. |
Lingua::ZH::Segment - Chinese Text Segmentation
This document describes version 0.01 of Lingua::ZH::Segment, released March 10, 2005.
use Lingua::ZH::Segment;
print segment('降龍18掌'); # 降 龍 18 掌
This module currently only break chinese text into single character (Chinese word), it will not break up any alphabet.
Currently, only segment is available.
Cheng-Lung Sung <clsung@tw.freebsd.org>
Hsin-Chan Chien for inspiring me about Encode::Guess.
Copyright 2005 by Cheng-Lung Sung <clsung@tw.freebsd.org>
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
| Lingua-ZH-Segment documentation | Contained in the Lingua-ZH-Segment distribution. |
# $Id: Segment.pm 1211 2005-03-10 14:10:14Z clsung $ package Lingua::ZH::Segment; use strict; use Encode::Guess; our @ISA = qw(Exporter); our @EXPORT = qw(segment); our $VERSION = '0.02';
sub segment { my $word = shift; my $decoder = guess_encoding ($word, qw/ utf8 big5 /); $word = $decoder->decode($word); my @segs = split /([A-z|\d]+|\S)/, $word; $word = join " ",@segs; $word =~ s/\s{2,}/ /g; $word =~ s/(^\s|\s$)//g; $word = $decoder->encode($word); return $word; } sub CLONE { } sub DESTROY { } 1;