NAME
Lingua::ZH::CCDICT - An interface to the CCDICT Chinese dictionary
SYNOPSIS
use Lingua::ZH::CCDICT;
my $dict = Lingua::ZH::CCDICT->new( storage => 'InMemory' );
DESCRIPTION
This module provides a Perl interface to the CCDICT dictionary created by Thomas Chin. This dictionary is indexed by Unicode character number (traditional character), and contains information about these characters.
CCDICT is released under a Creative Commons Attribution License (version 2.5).
The dictionary contains the following information, though not all information is avaialable for every character.
In addition, the dictionary contains English definitions (often multiple definitions), and romanizations for the character in different languages and systems. The romanizations available include Pinjim for Hakka, Jyutping for Cantonese and (Hanyu) Pinyin for Mandarin.
DICTIONARY BUGS
The CCDICT dictionary is distributed by Thomas Chin in a simple but non-standard, textual ASCII-only format. I've tried to work around errors or ambiguities in the dictionary data, although there are probably still oddities lurking. Please send bug reports to me so I can figure out whether the error is in my code or the dictionary itself.
STORAGE
This module is capable of parsing the CCDICT format file, and can also store the data in other formats (just Berkeley DB fo rnow).
Each storage system is implemented via a module in the `Lingua::ZH::CCDICT::Storage::*' class hierarchy. All of these modules are subclasses of `Lingua::ZH::CCDICT' class, and implement its methods for searching the dictionary.
In addition some storage classes may offer additional methods.
Storage Subclasses
The following storage subclasses are available:
USAGE
This module allows you to look up information in the dictionary based on a number of keys. These include the Unicode character (as a character, not its number), stroke count, radical number, and any of the various romanization systems.
METHODS
This class provides the following methods.
Lingua::ZH::CCDICT->new(...)
This method always takes at least one parameter, "storage". This
indicates what storage subclass to use. The current options are
"InMemory" and "BerkeleyDB".
Any other parameters given will be passed to the appropriate subclass's `new()' method.
$dict->parse_source_file($filename)
If you don't specify a file, then it will use the data file distributed
with this module. This is probably what you want, unless you have a
local copy of the dictionary that you want to work with. Note that the
dictionary format has changes a fair bit between versions, so this
probably won't work with much older or newer versions of the CCDICT
data.
This method is what does the real work of creating a dictionary. Note that if you are not using the InMemory storage subclass, you only need to parse the source file once, and then you can reuse the stored data.
MATCH METHODS
When doing a lookup based on the romanization of a character, the tone is indicated with a number at the end of the syllable, as opposed to using the Unicode character combining the latin letter with the diacritic.
In addition, lookups based on a Pinyin romanization should use the u-with-umlaut character (character 252 in Unicode) rather than two "u" characters.
The return value for any lookup will be an object in a `Lingua::ZH::CCDICT::ResultSet' subclass.
Result sets always return matches in ascending Unicode character order.
$ccdict->match_unicode(@chars)
This method matches on one or more Unicode characters. Unicode
characters should be given as Perl characters (i.e. `chr(0x7D20)'), not
as a number.
This dictionary index uses traditional Chinese characters. Simplified character lookups will not work (but you could use `Encode::HanConvert' to convert simple to traditional first).
$ccdict->match_radical(@numbers)
Given a set of numbers, this method returns those characters containing
the specified radical(s).
$ccdict->match_index(@numbers)
Given a set of numbers, this method returns those characters containing
the specified index(es).
$ccdict->match_stroke_count(@numbers) Given a set of numbers, this method returns those characters containing the specified number(s) of strokes.
$ccdict->match_cangjie(@codes)
Given a set of Cangjie codes, this method returns the character(s) for
those code(s).
$ccdict->match_four_corner(@codes)
Given a set of Four Corner codes, this method returns the character(s)
for those code(s).
$ccdict->match_pinjim(@romanizations)
$ccdict->match_jyutping(@romanizations)
$ccdict->match_pinyin(@romanizations)
$ccdict->all_characters()
Returns a result set containing all of the characters in the dictionary.
$ccdict->entry_count()
Returns the number of entries in the dictionary
ENVIRONMENT VARIABLES
There are several environment variables you can set to change this module's behavior.
AUTHOR
David Rolsky <autarch@urth.org>
COPYRIGHT
Copyright (c) 2002-2007 David Rolsky. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
CCDICT is copyright (c) 1995-2006 Thomas Chin.
SEE ALSO
Lingua::ZH::CEDICT - for converting between Chinese and English.
Encode::HanConvert - for converting between simplified and traditional characters in various character sets.
http://www.chinalanguage.com/dictionaries/CCDICT/ - the home of the CCDICT dictionary.