Lingua::ZH::CEDICT::Textfile - Interface for cedict.b5


Lingua-ZH-CEDICT documentation Contained in the Lingua-ZH-CEDICT distribution.

Index


Code Index:

NAME

Top

Lingua::ZH::CEDICT::Textfile - Interface for cedict.b5

SYNOPSIS

Top

  use Lingua::ZH::CEDICT;

  # these are the default values; you may omit them (except source)
  $dict = Lingua::ZH::CEDICT->new(source        => "Textfile",
                                  filename      => "cedict.b5",
                                  sourceCharset => "Big5",
                                  targetCharset => "UTF-8");

  # read the file
  $dict->init();

DESCRIPTION

Top

This module imports CEDICT from a file, e.g. from the original cedict.b5. It will attempt to do a charset conversion if sourceCharset and targetCharset have a true value and differ.

METHODS

Top

There are a number of methods you might find useful to work with the data once it is in memory. They are included and described in Lingua::ZH::CEDICT, just in case you want to use them with one of the other interface modules as well.

PREREQUISITES

Top

Lingua::ZH::Cedict (Lingua::ZH::Cedict).

If you are doing charset conversions (e.g. Big5 to UTF-8), you will need Text::Iconv.

AUTHOR

Top

Christian Renz, <crenz@web42.com>

LICENSE

Top

Copyright (C) 2002-2005 Christian Renz. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

Top

Lingua::ZH::CEDICT. Text::Iconv.


Lingua-ZH-CEDICT documentation Contained in the Lingua-ZH-CEDICT distribution.

package Lingua::ZH::CEDICT::Textfile;

# Copyright (c) 2002-2005 Christian Renz <crenz@web42.com>
# This module is free software; you can redistribute it and/or modify
# it under the same terms as Perl itself.

# $Id: Textfile.pm,v 1.3 2002/08/13 20:57:45 crenz Exp $

use bytes;
use strict;
use warnings;
use vars qw($VERSION @ISA);

$VERSION = '0.03';
@ISA = qw(Lingua::ZH::CEDICT);

sub new {
    my $class = shift;
    my $self = +{@_};

    $self->{filename}      ||= "cedict_ts.u8";
    $self->{sourceCharSet} ||= "UTF-8";
    $self->{targetCharSet} ||= "UTF-8";

    bless $self, $class;
}

sub init {
    my ($self) = @_;
    my $fh;

    my $iconv;
    if ($self->{sourceCharset} && $self->{targetCharset} &&
        ($self->{sourceCharset} ne $self->{targetCharset})) {
        require Text::Iconv;
        Text::Iconv->raise_error(1);
        $iconv = Text::Iconv->new($self->{sourceCharset}, $self->{targetCharset});
    }

    $self->{entry} = [];
    open $fh, "<" . $self->{filename}
        or die "Can't open $self->{filename}: $!\n";
    $self->{version} = <$fh>;
    while (<$fh>) {
        next unless /\w/;
        $_ = $iconv->convert($_) if defined $iconv;

        m|^(\S+)\s(\S+)\s\[([a-z0-9: ]+)\]\s/(.*)/\s*$| or
#        m|^(\S+\|\S*)\s\[([a-z0-9: ]+)\]\s/(.*)/\s*$| or
            die "Line $.: Invalid entry '$_'\n";
#        my @zi = split /\|/, $1;
#        $zi[1] ||= '';
#        $zi[1] = '' if (index($zi[1], '?') >= 0);
        my $tonelessPinyin = $self->removePinyinTones($3);
#        print "@zi // $2 // $3\n";
        push @{$self->{entry}}, [ $1, $2, $3, $tonelessPinyin, $4 ];
#        push @{$self->{entry}}, [ $zi[0], $zi[1], $2, $tonelessPinyin, $3 ];
    }
    close $fh;

    $self->{numEntries} = scalar @{$self->{entry}};
}

1;

__END__