| Lingua-JA-FindDates documentation | Contained in the Lingua-JA-FindDates distribution. |
Lingua::JA::FindDates - scan text to find Japanese dates
To find and replace Japanese dates in a string,
use Lingua::JA::FindDates 'subsjdate'; # Given a string, find and substitute all the Japanese dates in it. my $dates = '昭和41年三月16日'; print subsjdate ($dates);
prints
March 16, 1966
Find and substitute Japanese dates within a string:
my $dates = 'blah blah blah 三月16日'; print subsjdate ($dates);
prints
blah blah blah March 16
subsjdate can also call back a user-supplied routine each
time a date is found:
sub replace_callback
{
my ($data, $before, $after) = @_;
print "$before was replaced by $after\n";
}
my $dates = '三月16日';
my $data = 'xyz'; # something to send to replace_callback
subsjdate ($dates, {replace => \&replace_callback, data => $data});
prints
三月16日 was replaced by March 16
You can also use a routine to format the date any way, letting
subsjdate print it for you:
sub my_date
{
my ($date) = @_;
return join '/', $date->{month}."/".$date->{date};
}
my $dates = '三月16日';
print subsjdate ($dates, {make_date => \&my_date});
This prints
3/16
This module uses a set of regular expressions to detect Japanese-style dates in a string. It recognizes typical Japanese year/month/day-style dates such as 平成20年七月十日 Heisei nijuunen shichigatsu tooka. It also recognizes combinations such as years alone, years and months, a month and day without a year, fiscal years, parts of the month like 中旬 (chuujun, the middle of the month), and periods between two dates.
This module has been road-tested on hundreds of documents, and it can cope with virtually any kind of common Japanese date. If you find that it cannot identify some kind of date within Japanese text, please report that as a bug.
If you would like to see more examples of how this module works, look
at the testing code in t/Lingua-JA-FindDates.t.
This module exports one function, subsjdate, on request.
| Lingua-JA-FindDates documentation | Contained in the Lingua-JA-FindDates distribution. |
package Lingua::JA::FindDates; use 5.008000; require Exporter; use AutoLoader qw(AUTOLOAD); our @ISA = qw(Exporter); @EXPORT_OK= qw/subsjdate/; our $VERSION = '0.011'; use warnings; use strict; use Carp; use utf8; my %kanjinums = ( ä¸ => 1, äº => 2, ä¸ => 3, å => 4, äº => 5, å => 6, ä¸ => 7, å « => 8, ä¹ => 9, å => 10, ç¾ => 100, å => 1000, # Dates shouldn't get any bigger than this X a digit ); my $kanjidigits = join ('', keys %kanjinums); =head2 kanji2number =over =item kanji2number ($knum) C<kanji2number> is a very simple kanji number convertor. Its input is one string of kanji numbers only, like 'ä¸åä¸'. It can deal with kanji numbers with or without ten/hundred/thousand kanjis. The return value is the numerical value of the kanji number, like 31, or zero if it can't read the number. This function is not exported. =back =head3 Bugs kanji2number only goes up to thousands, because usually dates only go that far. If you need a comprehensive Japanese number convertor, we recommend using L<Lingua::JA::Numbers> instead of this. Also, it doesn't deal with mixed kanji and arabic numbers. =cut sub kanji2number { my ($knum) = @_; return 1 if $knum eq 'å '; my @kanjis = split '', $knum; my $value = 0; my $keta = 1; my @values; while (1) { my $k = pop @kanjis; return $value if !$k; my $val = $kanjinums{$k}; if (!$val) { carp "can't cope with '$k' of input '$knum'.\n"; return 0; } if ($val >= 10) { $keta = $val; my $knext = pop @kanjis; if (!$knext) { return $value + $val; } my $val_next = $kanjinums{$knext}; if (!$val_next) { carp "can't cope with '$knext' of input '$knum'.\n"; return 0; } if ($val_next > 10) { push @kanjis, $knext; $value += $val; } else { $value += $val_next * $val; } } else { $value += $val * $keta; $keta *= 10; } } } # Map double-byte numbers to single byte numbers. my $nums = 'ï¼ï¼ï¼ï¼ï¼ï¼ï¼ï¼ï¼ï¼'; my @wnums = split '', $nums; my %wtonarrow; for (0..9) {$wtonarrow{$wnums[$_]} = $_} my $jdigit = '[ï¼-ï¼0-9]'; # Japanese number my $jnumber = "($jdigit+|[$kanjidigits]+)"; # Western year my $wyear = '('.$jdigit.'{4}|['.$kanjidigits.']?å['.$kanjidigits.']*|'. '[\']'.$jdigit.'{2}'. ')\s*å¹´'; # Japanese eras (Heisei, Showa, Taisho, Meiji). Japanese people # sometimes write these eras using the letters H, S, T, and M. my $jera = '(H|H|å¹³æ|S|ï¼³|æå|T|ï¼´|大æ£|M|ï¼|ææ²»)'; # Map of Japanese eras to Western dates. my %jera2w = ( H => 1988, H => 1988, å¹³æ => 1988, S => 1925, ï¼³ => 1925, æå => 1925, T => 1911, ï¼´ => 1911, å¤§æ£ => 1911, M => 1869, ï¼ => 1869, ææ²» => 1869, ); # Japanese year, with era like "Heisei" at the beginning. my $jyear = $jera.'\h*('."$jdigit+|[$kanjidigits]+".'|å )\h*å¹´'; # Ten day periods (thirds of a month) my %jun = qw/å 1 ä¸ 1 ä¸ 2 ä¸ 3/; my @jun2english = ('invalid', 'early ', 'mid-', 'late '); # Japanese days of the week, from Monday to Sunday. my $weekdays = 'æç«æ°´æ¨é忥'; my @weekdays = split '',$weekdays; # Match a string for a weekday, like æææ¥ or (æ¥) # The long part (?=\W) is to stop it from accidentally matching a # kanji which is part of a different word, like the following: #å¹³æäºå年䏿䏿¥ # æ¥æ¬è«ã»æ¥æ¬äººè«ã¯é常ã«é¢ç½ããã®ã ã my $match_weekday = '[ï¼(]?(['.$weekdays.'])'. '(?:(?:(?:ææ¥|æ)[)\ï¼])|[)\ï¼]|(?=\W))'; # my $match_weekday = '[ï¼(]?(['.$weekdays.'])(?:ææ¥|æ)?[)ï¼]?'; # Match a day of the month, like 10æ¥ my $match_dom = $jnumber.'\h*æ¥'; # Match a month my $match_month = $jnumber.'\h*æ'; # Match jun my $match_jun = '(['.join ('', keys %jun).'])\h*æ¬'; # Match a month+jun my $match_month_jun = $match_month.'\h*'.$match_jun; # Match a month and day of month pair my $match_month_day = $match_month.'\h*'.$match_dom; # Match a Japanese year, month, day string my $matchymd = $jyear.'\h*'.$match_month_day; # Match a Western year, month, day string my $matchwymd = $wyear.'\h*'.$match_month_day; # Match a Japanese year and month only my $match_jyear_month = $jyear.'\h*'.$match_month; # Match a Western year and month only my $match_wyear_month = $wyear.'\h*'.$match_month; # Match a month, day, weekday. my $match_month_day_weekday = $match_month_day.'\h*'.$match_weekday; # Separators used in date strings # Microsoft Word uses Unicode 0xFF5E, the "fullwidth tilde", for nyoro symbol. my $separators = '\h*[ãâ~]\h*'; # # =head2 Matching patterns # I<The module can be used without reading this section>. # The Japanese date regular expressions are stored in an array # B<jdatere> containing a pair of a regular expression to match a kind # of date, and a string like "ymdw" which contains letters saying what # to do with $1, $2, etc. from the regular expression. The array # B<jdatere> is ordered from longest match (like "year / month / day / # weekday") to shortest (like "year" only). For example, if the first # letter is "y", then $1 is a year in Western format like 2008, or if # the third letter is "w", then $3 is the day of the week, from 1 to 7. # =over # =item e # Japanese era (string). # =item j # Japanese year (string representing small number) # =item x # empty month and day # =item m # month number (from 1 to 12, 13 for a blank month, 0 for an invalid month) # =item d # day of month (from 1 to 31, 32 for a blank day, 0 for an invalid day) # =item w # weekday (from Monday = 1 to Sunday = 7, zero or undefined for an # invalid weekday) # =item z # jun (æ¬), a ten day period. # =item 1 # After another code, indicates the first of a pair of two things. For # example, the matching code for # å¹³æï¼å¹´ï¼ï¼æï¼ï¼æ¥ãï¼ï¼æ¥ # is # ejmd1d2 # =back # =cut my @jdatere = ( # Match an empty string like å¹³æ æ æ¥ as found on a form etc. [$jyear.'(\h+)æ\h+æ¥', "ejx"], # Add match for dummy strings here # Match a Japanese era, year, 2 x (month day weekday) combination [$matchymd.'\h*'.$match_weekday.$separators. $match_month_day_weekday, "ejm1d1w1m2d2w2"], # Match a Japanese era, year, month 2 x (day, weekday) combination [$matchymd.$match_weekday.$separators.$match_dom.'\h*'.$match_weekday, "ejmd1w1d2w2"], # Match a Japanese era, year, month 2 x day combination [$matchymd.$separators.$match_dom.'\h*'.$match_weekday, "ejmd1d2"], # Match a Western year, 2x(month, day, weekday) combination [$matchwymd.'\h*'.$match_weekday.$separators.$match_month_day_weekday, "ym1d1w1m2d2w2"], # Match a Western year, month, 2x(day, weekday) combination [$matchwymd.'\h*'.$match_weekday.$separators.$match_dom.'\h*'.$match_weekday, "ymd1w1d2w2"], # Match a Western year, month, 2x(day) combination [$matchwymd.$separators.$match_dom, "ymd1d2"], # Match a Japanese era, year, month1 day1 - month 2 day2 combination [$matchymd.$separators.$match_month_day, "ejm1d1m2d2"], # Match a Japanese era, year, month1 - month 2 combination [$jyear.'\h*'.$jnumber.'\h*æ?'.$separators.$match_month, "ejm1m2"], # Match a Japanese era, year, month, day1 - day2 combination [$match_jyear_month.'\h*'.$jnumber.'\h*æ¥?'.$separators.$match_dom, "ejmd1d2"], # Match a Japanese era, year, month, day, weekday combination [$matchymd.'\h*'.$match_weekday , "ejmdw"], # Match a Japanese era, year, month, day [$matchymd , "ejmd"], # Match a Japanese era, year, month, jun [$match_jyear_month.'\h*'.$match_jun , "ejmz"], # Match a Western year, month, day, weekday combination [$matchwymd.'\h*'.$match_weekday , "ymdw"], # Match a Western year, month, day combination [$matchwymd , "ymd"], # Match a Western year, month, jun combination [$match_wyear_month.'\h*'.$match_jun , "ymz"], # Match a Japanese era, year, month [$jyear.'\h*'.$jnumber.'\h*æ' , "ejm"], # Match a Western year, month [$match_wyear_month , "ym"], # Match 2 x (month, day, weekday) [$match_month_day_weekday.$separators.$match_month_day_weekday, "m1d1w1m2d2w2"], # Match month, 2 x (day, weekday) [$match_month_day_weekday.$separators.$match_dom.'\h*'.$match_weekday, "md1w1d2w2"], # Match month, 2 x (day, weekday) [$match_month_day.$separators.$match_dom, "md1d2"], # Match a month, day, weekday [$match_month_day_weekday , "mdw"], # Match a month, day [$match_month_day , "md"], # Match a fiscal year (年度, nendo in Japanese). These usually don't # have months combined with them, so there is nothing to match a # fiscal year with a month. [$jyear.'度' , "en"], # Match a fiscal year (年度, nendo in Japanese). These usually don't # have months combined with them, so there is nothing to match a # fiscal year with a month. [$wyear.'度' , "n"], # Match a Japanese era, year [$jyear, "ej"], # Match a Western year [$wyear , "y"], # Match a month with a jun [$match_month.'\h*'.$match_jun , "mz"], # Match a month [$match_month , "m"], ); my @months = qw/Invalid January February March April May June July August September October November December MM/; my @days = qw/Invalid Monday Tuesday Wednesday Thursday Friday Saturday Sunday/; my %j2eweekday; @j2eweekday{@weekdays} = (1..7); =head2 make_date =over =item make_date ($date) C<make_date> is the default date-string-making routine. It turns the date information supplied to it into a string representing the date. C<make_date> is not exported. L<subsjdate>, given a date like å¹³æï¼ï¼å¹´ï¼æï¼æ¥ï¼æ¨ï¼ (Heisei year 20, month 7, day 3, in other words "Thursday the third of July, 2008"), passes C<make_date> a hash reference with values (year =>2008, month => 7, date => 3, wday => 4) for the year, month, date and day of the week. C<make_date> returns a string, 'Thursday, July 3, 2008'. If some fields of the date aren't defined, for example in the case of a date like ï¼æï¼æ¥ (3rd July), the hash values for the keys of the unknown parts of the date, such as year or weekday, will be undefined. To replace the default routine C<make_date> with a different format, supply a C<make_date> callback to L<subsjdate>: sub my_date { my ($date) = @_; return join '/', $date->{month}."/".$date->{date}; } my $dates = '䏿ï¼ï¼æ¥'; print subsjdate ($dates, {make_date => \&my_date}); This prints 3/16 =back =cut sub make_date { my ($datehash) = @_; my ($year, $month, $date, $wday, $jun) = @{$datehash}{qw/year month date wday jun/}; if (!$year && !$month && !$date && !$jun) { carp "No valid inputs\n"; return; } my $edate = ''; $edate = $days[$wday].", " if $wday; if ($month) { $month = int ($month); # In case it is 07 etc. $edate .= $months[$month]; if ($jun) { $edate = $jun2english[$jun] . $edate; } } if ($date) { $edate .= " " if length ($edate); $date = int ($date); # In case it is 07 etc. $date = "DD" if $date == 32; if ($year) { $edate .= "$date, $year"; } else { $edate .= "$date"; } } elsif ($year) { $edate .= " " if length ($edate); $edate .= $year; } return $edate; } =head2 make_date_interval This function is called when an interval of two dates, such as å¹³æï¼å¹´ ï¼æï¼æ¥ãï¼æ¥, is detected. It makes a string to represent that interval in English. It takes two arguments, hash references to the first and second date. The hash references are in the same format as L<make_date>. This function is not exported. It is the default used by C<subsjdate>. You can use another function instead of this default by supplying a value C<make_date_interval> as a callback in L<subsjdate>. =cut sub make_date_interval { my ($date1, $date2) = @_; my $einterval = ''; my $usecomma; # The case of an interval with different years doesn't need to be # considered, because each date in that case can be considered a # single date. if ($date2->{month}) { if (!$date1->{month}) { carp "end month but no starting month"; return; } } if ($date1->{month}) { if ($date1->{wday} && $date2->{wday}) { if (! $date1->{date} || ! $date2->{date}) { carp "malformed date has weekdays but not days of month"; return; } $usecomma = 1; $einterval = $days[$date1->{wday}] . " " . $date1->{date} . ($date2->{month} ? ' '.$months[int ($date1->{month})] : ''). '-' . $days[$date2->{wday}] . " " . $date2->{date} . " " . ($date2->{month} ? $months[int ($date2->{month})] : $months[int ($date1->{month})]); } elsif ($date1->{date} && $date2->{date}) { $usecomma = 1; if ($date1->{wday} || $date2->{wday}) { carp "malformed date interval: ", "has weekday for one date but not the other one."; return; } $einterval = $months[int ($date1->{month})] . ' ' . $date1->{date} . '-' . ($date2->{month} ? $months[int ($date2->{month})] . ' ' : '') . $date2->{date}; } else { # no dates or weekdays if ($date1->{date} || $date2->{date}) { carp "malformed date interval: only one day of month"; return; } if (!$date2->{month}) { carp "start month but no end month or date"; return; } $einterval = $months[int($date1->{month})] . '-' . $months[int($date2->{month})] . $einterval; } } else { # weekday - day / weekday - day case. if ($date1->{wday} && $date2->{wday}) { if (! $date1->{date} || ! $date2->{date}) { carp "malformed date has weekdays but not days of month"; return; } $einterval = $date1->{wday} . " " . $date1->{date} . '-' . $date2->{wday} . " " . $date2->{date}; } } $einterval .= ($usecomma ? ', ': ' ').$date1->{year} if $date1->{year}; return $einterval; } =head2 $verbose If you want to see what the module is doing, set $Lingua::JA::FindDates::verbose = 1; This makes L<subsjdate> print out each regular expression and reports whether it matched, which looks like this: Looking for y in ([ï¼-ï¼0-9]{4}|[åå ä¸ä¹äºååç¾äºä¸å «ä¸]?å[åå ä¸ä¹äºååç¾äºä¸å «ä¸]*)\h*å¹´ Found 'åä¹ç¾å åå å¹´': Arg 0: 1966 -> '1966' =cut our $verbose = 0; =head2 subsjdate =over =item subsjdate ($L<text>, $L<callbacks>) "subsjdate", given a string (argument 1) containing some text like å¹³ æï¼ï¼å¹´ï¼æï¼æ¥ï¼æ¨ï¼, looks through the string using a set of regular expressions, and if it finds anything, it calls L<make_date> to make the equivalent date in English, and then substitutes it into $text: C<$text =~ s/å¹³æï¼ï¼å¹´ï¼æï¼æ¥ï¼æ¨ï¼/Thursday, July 3, 2008/g>; Users can supply a different date making function. See below. =item text A string, encoded in Perl's internal encoding. =item callbacks The hash reference C<$callbacks> can take the following items: =over 2 =item replace If there is a replace value in the callbacks, subsjdate calls it as a subroutine with the data in C<$callbacks->{data}> and the before and after string. =item data Any data you want to pass to the L<replace> callback. =item make_date This is a replacement for the L<make_date> function. If you don't need to replace the default (if you want American-style dates), you can leave this blank. If, for example, you want dates in the form "Th 2008/7/3", you could write a routine like the following: sub mymakedate { my ($date) = @_; return qw{Bad Mo Tu We Th Fr Sa Su}[$date->{wday}]. $date->{year}.'/'.$date->{month}.'/'.$date->{date}; } Note that you need to check for the hash values for year, month, date, and wday being zero, since L<subsjdate> matches "month/day" and "year/month" only dates. =item make_date_interval This is a replacement for the L<make_date_interval> function. Its arguments are two dates. =back =back =head3 Bugs =over =item No sanity check of Japanese era dates It does not detect that dates like æåç¾å¹´ (Showa 100, an impossible year) are invalid. =item Only goes back to Meiji The date matching only goes back to the Meiji era. There is L<DateTime::Calendar::Japanese::Era> if you need to go back further. =item Doesn't find dates in order The dates returned won't be in the order that they are in the text, but in the order that they are found by the regular expressions, which means that in a string with two dates, the callbacks might be called for the second date before they are called for the first one. Basically the longer forms of dates are searched for before the shorter ones. =item UTF-8 version only This module only understands Japanese encoded in Perl's internal form (UTF-8). =item Trips a bug in Perl 5.10 If you send subsjdate a string which is pure ASCII, you'll get a stream of warning messages about "uninitialized value". The error messages are wrong - this is actually a bug in Perl, reported as bug number 56902 (L<http://rt.perl.org/rt3/Public/Bug/Display.html?id=56902>). But sending this routine a string which is pure ASCII doesn't make sense anyway, so don't worry too much about it. =item Doesn't do å æ¥ (I<ganjitsu>) This date (another way to write "1st January") is a little difficult, since the characters which make it up could also occur in other contexts, like å æ¥æ¬è» I<gennihongun>, "the former Japanese military". Correctly parsing it requires a linguistic analysis of the text, which this module isn't able to do. =back =cut sub subsjdate { # $text is the text to substitute. It needs to be in Perl's # internal encoding. # $replace_callback is a routine to call back if we find valid dates. # $data is arbitrary data to pass to the callback routine. my ($text, $callbacks) = @_; for my $datere (@jdatere) { my $regex = $$datere[0]; my @process = split (/(?=[a-z][12]?)/, $$datere[1]); print "Looking for ",$$datere[1]," in ",$regex,"\n" if $verbose; while ($text =~ /($regex)/g) { my $date1; my $date2; my $orig = $1; # print "Keys are ",$$datere[1],"\n"; my @matches = ($2,$3,$4,$5,$6,$7,$8,$9); # uh - oh. Be careful! print "Found '$orig': " if $verbose; for (0..$#matches) { my $arg = $matches[$_]; last if !$arg; $arg =~ s/([ï¼-ï¼])/$wtonarrow{$1}/g; $arg =~ s/([$kanjidigits]+|å )/kanji2number($1)/ge; print "Arg $_: $arg " if $verbose; my $argdo = $process[$_]; # print $argdo,"\n"; if ($argdo eq 'e') { # Era name in Japanese $date1->{year} = $jera2w{$arg}; } elsif ($argdo eq 'j') { # Japanese year $date1->{year} += $arg; } elsif ($argdo eq 'y') { $date1->{year} = $arg; } elsif ($argdo eq 'n') { $date1->{year} += $arg; $date1->{year} = "fiscal ".$date1->{year}; } elsif ($argdo eq 'm' || $argdo eq 'm1') { $date1->{month} = $arg; } elsif ($argdo eq 'd' || $argdo eq 'd1') { $date1->{date} = $arg; } elsif ($argdo eq 'm2') { $date2->{month} = $arg; } elsif ($argdo eq 'd2') { $date2->{date} = $arg; } elsif ($argdo eq 'w' || $argdo eq 'w1') { $date1->{wday} = $j2eweekday{$arg}; } elsif ($argdo eq 'w2') { # print "W2\n"; $date2->{wday} = $j2eweekday{$arg}; } elsif ($argdo eq 'z') { $date1->{jun} = $jun{$arg}; # print "\n*Jun of $arg is ",$date1->{jun},"\n"; } elsif ($argdo eq 'x') { print "Dummy date '$orig'.\n" if $verbose; $date1->{date} = 32; $date1->{month} = 13; } } my $edate; if ($date2) { if ($callbacks->{make_date_interval}) { $edate = &{$callbacks->{make_date_interval}} ($date1, $date2); } else { $edate = make_date_interval ($date1, $date2); } } else { if ($callbacks->{make_date}) { $edate = &{$callbacks->{make_date}}($date1); } else { $edate = make_date ($date1); } } print "-> '$edate'\n" if $verbose; $text =~ s/\Q$orig\E/$edate/g; if ($callbacks->{replace}) { &{$callbacks->{replace}}($callbacks->{data}, $orig, $edate); } } } return $text; } =head1 Author Ben Bullock, benkasminbullock@gmail.com =cut # =head1 History # This routine started life as a Visual Basic for Applications (VBA) # script (a "Word Macro") to automatically convert Japanese dates in a # Microsoft Word document into their equivalent English versions. See # http://linuxtnt.wordpress.com/2008/04/16/visual-basic-date-translator-updated/ # . Eventually, because I kept finding exceptions & I didn't know Visual # Basic well enough to code that efficiently, I decided to rewrite it # all in Perl, using L<Win32::OLE> to automate the operation of # Microsoft Word. (The Microsoft Word handlers are not included in this # module.) # The basic idea is to ask Word to save a copy of the file as text via # OLE, then read the text file in to Perl, look for dates in the text # using L<subsjdate>, and then call back into Microsoft Word using the # L<replace> callback argument to L<subsjdate> to substitute the # Japanese dates with English ones. =head1 Motivation The motivation for creating this module was as a form of assistance for translation of documents from Japanese into English, especially documents containing a large number of dates. =head1 See also These other modules might be more suitable for some purposes: =over =item L<DateTime::Locale::JA> This does the minimal stuff to make a Japanese date. One of those modules which has been made for completeness rather than for usefulness, it doesn't represent Japanese language usages very well, failing to contain Japanese eras, kanji numbers, wide numbers, etc. =item L<DateTime::Format::Japanese> This parses Japanese dates. Unlike the present module it claims to also format them, so it can turn a L<DateTime> object into a Japanese date, and it also does times. However, the module seems to be broken - it doesn't install on any system I've tried. =item L<Lingua::JA::Numbers> This module has a very full set of kanji / numeral convertors. It converts numbers including decimal points and numbers into the billions and trillions. =item L<DateTime::Calendar::Japanese::Era> This module contains a full set of Japanese eras. =back =head1 COPYRIGHT AND LICENCE Copyright (C) 2008 Ben Kasmin Bullock. This module is distributed under the same terms as Perl itself, either Perl version 5.10.0 or, at your option, any later version of Perl 5 you may have available. =cut 1; __END__