| Lingua-EL-Poly2Mono documentation | Contained in the Lingua-EL-Poly2Mono distribution. |
Lingua::EL::Poly2Mono - Convert polytonic Greek to monotonic
This document describes version .02 of Lingua::EL::Poly2Mono, released in October of 2006.
use Lingua::EL::Poly2Mono 'poly2mono'; $monotonic_equivalent = poly2mono $polytonic_text; # OR use Lingua::EL::Poly2Mono; $monotonic_equivalent = Lingua::EL::Poly2Mono::poly2mono $polytonic_text;
This module provides one exportable subroutine, poly2mono, which
takes a traditional polytonic Greek string as its sole argument and
converts in to Modern monotonic. The input string can be either a
Unicode string or a sequence of raw Unicode bytes. The return value will
be in the same format.
To make this clearer:
# Unicode string:
$mono = poly2mono "\x{1f21}"; # eta with dasia
# $mono now contains "\x{03b7}" (unaccented eta)
# raw Unicode bytes:
$mono = poly2mono "\xe1\xbc\xa1";
# $mono now contains "\xce\xb7"
This module has only been tested with Perl 5.002_01
and 5.8.6 (in 5.002_01 you need parentheses around the argument or
a
use subs 'poly2mono' statement). It uses the Encode module's
is_utf8 function to distinguish
between the two types of input. If this function (or the Encode module)
is not available, the
input will be treated as bytes.
0.02 (October 2006, this version) Accentuation was corrected for the words ή, για, πιο and πια.
0.01 (April 2006) The first version
Father Chrysostomos <sprout (at]cpan.org>
| Lingua-EL-Poly2Mono documentation | Contained in the Lingua-EL-Poly2Mono distribution. |
package Lingua::EL::Poly2Mono; require Exporter; use strict; # :-( use vars qw/ $VERSION @ISA @EXPORT_OK $OLD_PERL $C $conson $psiliaccent_lc $gramma $diacr $ui $diphpre $accent %remove %p2m %direm /; $VERSION = 0.02; @ISA = 'Exporter'; # @ISNTA = 'Deporter'; @EXPORT_OK = 'poly2mono'; { local $@ ; eval { require Encode; Encode->import(qw/is_utf8 encode_utf8 decode_utf8/) }; $@ and ++$OLD_PERL; } $C = '(?:[\x00-\x7f]|[\xc0-\xff][\x80-\xbf]+)'; $conson = "Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î |Ρ|Σ|Τ|Φ|Χ|Ψ|β|γ|δ|ζ|θ|κ|λ|μ|ν|ξ|Ï|Ï|Ï|Ï|Ï|Ï|Ï|Ï|ῤ|á¿¥|Ῥ"; $psiliaccent_lc="á¼|á¼|á¼|á¼|á¼|á¼¢|ἤ|ἦ|á¼²|á¼´|á¼¶|á½|á½|á½|á½|á½|á½¢|ὤ|ὦ|á¾|á¾|á¾|á¾|á¾|á¾|á¾¢|ᾤ|ᾦ"; $gramma = "(Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î|Î |Ρ|Σ|Τ|Î¥|Φ|Χ|Ψ|Ω|Ϊ|Ϋ|ά|Î|ή|ί|ΰ|α|β|γ|δ|ε|ζ|η|θ|ι|κ|λ|μ|ν|ξ|ο|Ï|Ï|Ï|Ï|Ï|Ï |Ï|Ï|Ï|Ï|Ï|Ï|Ï|Ï|Ï|á¼|á¼|á¼|á¼|á¼|á¼ |á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼ |ἡ|á¼¢|á¼£|ἤ|á¼¥|ἦ|á¼§|Ἠ|Ἡ|Ἢ|Ἣ|Ἤ|á¼|á¼®|Ἧ|á¼°|á¼±|á¼²|á¼³|á¼´|á¼µ|á¼¶|á¼·|Ἰ|á¼¹|Ἲ|á¼»|á¼¼|á¼½|á¼¾|Ἷ|á½|á½|á½|á½|á½|á½ |á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½ |ὡ|á½¢|á½£|ὤ|á½¥|ὦ|á½§|Ὠ|Ὡ|Ὢ|Ὣ|Ὤ|á½|á½®|Ὧ|á½°|á½±|á½²|á½³|á½´|á½µ|á½¶|á½·|ὸ|á½¹|ὺ|á½»|á½¼|á½½|á¾|á¾|á¾|á¾|á¾|á¾ |á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾ |ᾡ|á¾¢|á¾£|ᾤ|á¾¥|ᾦ|á¾§|ᾨ|ᾩ|ᾪ|ᾫ|ᾬ|á¾|á¾®|ᾯ|á¾²|á¾³|á¾´|á¾¶|á¾·|Ὰ|á¾»|á¾¼|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿¢|á¿£|ῤ|á¿¥|ῦ|á¿§|Ὺ|á¿«|Ῥ|ῲ|ῳ|á¿´|á¿¶|á¿·|Ὸ|Ό|Ὼ|á¿»|ῼ)"; $diacr="Ï|Ï|á¼|á¼|á¼|á¼|á¼|á¼ |á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼ |ἡ|á¼¢|á¼£|ἤ|á¼¥|ἦ|á¼§|Ἠ|Ἡ|Ἢ|Ἣ|Ἤ|á¼|á¼®|Ἧ|á¼°|á¼±|á¼²|á¼³|á¼´|á¼µ|á¼¶|á¼·|Ἰ|á¼¹|Ἲ|á¼»|á¼¼|á¼½|á¼¾|Ἷ|á½|á½|á½|á½|á½|á½ |á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½ |ὡ|á½¢|á½£|ὤ|á½¥|ὦ|á½§|Ὠ|Ὡ|Ὢ|Ὣ|Ὤ|á½|á½®|Ὧ|á½°|á½±|á½²|á½³|á½´|á½µ|á½¶|á½·|ὸ|á½¹|ὺ|á½»|á½¼|á½½|á¾|á¾|á¾|á¾|á¾|á¾ |á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾ |ᾡ|á¾¢|á¾£|ᾤ|á¾¥|ᾦ|á¾§|ᾨ|ᾩ|ᾪ|ᾫ|ᾬ|á¾|á¾®|ᾯ|á¾²|á¾³|á¾´|á¾¶|á¾·|Ὰ|á¾»|á¾¼|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿¢|á¿£|ῤ|á¿¥|ῦ|á¿§|Ὺ|á¿«|Ῥ|ῲ|ῳ|á¿´|á¿¶|á¿·|Ὸ|Ό|Ὼ|á¿»|ῼ"; $ui="á¼°|á¼±|á¼²|á¼³|á¼´|á¼µ|á¼¶|á¼·|á½|á½|á½|á½|á½|á½|á½|á½|á½¶|á½·|ὺ|á½»|á¿|ῦ"; $diphpre="Î|Î|Î|Î|Î¥|α|ε|η|ο|Ï "; $accent="á¼|á¼|á¼|á¼ |á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼|á¼¢|á¼£|ἤ|á¼¥|ἦ|á¼§|Ἢ|Ἣ|Ἤ|á¼|á¼®|Ἧ|á¼²|á¼³|á¼´|á¼µ|á¼¶|á¼·|Ἲ|á¼»|á¼¼|á¼½|á¼¾|Ἷ|á½|á½|á½|á½ |á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½|á½¢|á½£|ὤ|á½¥|ὦ|á½§|Ὢ|Ὣ|Ὤ|á½|á½®|Ὧ|á½°|á½±|á½²|á½³|á½´|á½µ|á½¶|á½·|ὸ|á½¹|ὺ|á½»|á½¼|á½½|á¾|á¾|á¾|á¾ |á¾|á¾á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾|á¾¢|á¾£|ᾤ|á¾¥|ᾦ|á¾§|ᾪ|ᾫ|ᾬ|á¾|á¾®|ᾯ|á¾²|á¾´|á¾¶|á¾·|Ὰ|á¾»|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿|á¿¢|á¿£|ῦ|á¿§|Ὺ|á¿«|ῲ|á¿´|á¿¶|á¿·|Ὸ|Ό|Ὼ|á¿»"; # This is for removing koronides with accents, secondary accents at the # end of a word, and diereses preceded by accents. %remove = qw(Ï Î¹ Ï Ï á¼ Î± ἠα ἠα ἠα ἠα ἠα ἠα ἠα ἠΠἠΠἠΠἠΠἠΠἠΠἠΠἠΠἠε ἠε ἠε ἠε ἠε ἠε ἠΠἠΠἠΠἠΠἠΠἠΠἠη ἡ η á¼¢ η á¼£ η ἤ η á¼¥ η ἦ η á¼§ η Ἠ ΠἩ ΠἪ ΠἫ ΠἬ Î á¼ Î á¼® ΠἯ Î á¼° ι á¼± ι á¼² ι á¼³ ι á¼´ ι á¼µ ι á¼¶ ι á¼· ι Ἰ Î á¼¹ ΠἺ Î á¼» Î á¼¼ Î á¼½ Î á¼¾ ΠἿ Πὠο ὠο ὠο ὠο ὠο ὠο á½ Î á½ Î á½ Î á½ Î á½ Î á½ Î á½ Ï á½ Ï á½ Ï á½ Ï á½ Ï á½ Ï á½ Ï á½ Ï á½ Î¥ ὠΥ ὠΥ ὠΥ á½ Ï á½¡ Ï á½¢ Ï á½£ Ï á½¤ Ï á½¥ Ï á½¦ Ï á½§ Ï á½¨ Ω Ὡ Ω Ὢ Ω Ὣ Ω Ὤ Ω ὠΩ á½® Ω Ὧ Ω á½° α á½± α á½² ε á½³ ε á½´ η á½µ η á½¶ ι á½· ι ὸ ο á½¹ ο ὺ Ï á½» Ï á½¼ Ï á½½ Ï á¾ Î± ᾠα ᾠα ᾠα ᾠα ᾠα ᾠα ᾠα ᾠΠᾠΠᾠΠᾠΠᾠΠᾠΠᾠΠᾠΠᾠη ᾠη ᾠη ᾠη ᾠη ᾠη ᾠη ᾠη á¾ Î á¾ Î á¾ Î á¾ Î á¾ Î á¾ Î á¾ Î á¾ Î á¾ Ï á¾¡ Ï á¾¢ Ï á¾£ Ï á¾¤ Ï á¾¥ Ï á¾¦ Ï á¾§ Ï á¾¨ Ω ᾩ Ω ᾪ Ω ᾫ Ω ᾬ Ω ᾠΩ á¾® Ω ᾯ Ω á¾° α á¾± α á¾² α á¾³ α á¾´ α á¾¶ α á¾· α Ᾰ Î á¾¹ ΠᾺ Î á¾» Î á¾¼ Πῠη ῠη ῠη ῠη ῠη ῠΠῠΠῠΠῠΠῠΠῠι ῠι ῠι ῠι ῠι ῠι á¿ Î á¿ Î á¿ Î á¿ Î á¿ Ï á¿¡ Ï á¿¢ Ï á¿£ Ï á¿¦ Ï á¿§ Ï á¿¨ Î¥ á¿© Î¥ Ὺ Î¥ á¿« Î¥ ῲ Ï á¿³ Ï á¿´ Ï á¿¶ Ï á¿· Ï á¿¸ ΠΌ ΠῺ Ω á¿» Ω ῼ Ω); %p2m=qw{ἠα ἠα ἠά ἠά ἠά ἠά ἠά ἠά ἠΠἠΠἠΠἠΠἠΠἠΠἠΠἠΠἠε ἠε ἠΠἠΠἠΠἠΠἠΠἠΠἠΠἠΠἠΠἠΠἠη ἡ η á¼¢ ή á¼£ ή ἤ ή á¼¥ ή ἦ ή á¼§ ή Ἠ ΠἩ ΠἪ ΠἫ ΠἬ Î á¼ Î á¼® ΠἯ Î á¼° ι á¼± ι á¼² ί á¼³ ί á¼´ ί á¼µ ί á¼¶ ί á¼· ί Ἰ Î á¼¹ ΠἺ Î á¼» Î á¼¼ Î á¼½ Î á¼¾ ΠἿ Πὠο ὠο á½ Ï á½ Ï á½ Ï á½ Ï á½ Î á½ Î á½ Î á½ Î á½ Î á½ Î á½ Ï á½ Ï á½ Ï á½ Ï á½ Ï á½ Ï á½ Ï á½ Ï á½ Î¥ ὠΥ á½ Î á½ Î á½ Ï á½¡ Ï á½¢ Ï á½£ Ï á½¤ Ï á½¥ Ï á½¦ Ï á½§ Ï á½¨ Ω Ὡ Ω Ὢ ΠὫ ΠὬ Î á½ Î á½® ΠὯ Î á½° ά á½± ά á½² Î á½³ Î á½´ ή á½µ ή á½¶ ί á½· ί ὸ Ï á½¹ Ï á½º Ï á½» Ï á½¼ Ï á½½ Ï á¾ Î± ᾠα ᾠά ᾠά ᾠά ᾠά ᾠά ᾠά ᾠΠᾠΠᾠΠᾠΠᾠΠᾠΠᾠΠᾠΠᾠη ᾠη ᾠή ᾠή ᾠή ᾠή ᾠή ᾠή á¾ Î á¾ Î á¾ Î á¾ Î á¾ Î á¾ Î á¾ Î á¾ Î á¾ Ï á¾¡ Ï á¾¢ Ï á¾£ Ï á¾¤ Ï á¾¥ Ï á¾¦ Ï á¾§ Ï á¾¨ Ω ᾩ Ω ᾪ Πᾫ Πᾬ Î á¾ Î á¾® Πᾯ Î á¾² ά á¾³ α á¾´ ά á¾¶ ά á¾· ά Ὰ Î á¾» Î á¾¼ Πῠή ῠη ῠή ῠή ῠή ῠΠῠΠῠΠῠΠῠΠῠΠῠΠῠί á¿ Î á¿¢ ΰ á¿£ ΰ ῤ Ï á¿¥ Ï á¿¦ Ï á¿§ ΰ Ὺ Î á¿« ΠῬ Ρ ῲ Ï á¿³ Ï á¿´ Ï á¿¶ Ï á¿· Ï á¿¸ ΠΌ ΠῺ Î á¿» Πῼ Ω á¾½ â ᾿ â ´ Í´}; %direm = #dieresis removal qw{Ï Î¹ Ï Ï á¿ Î¯ á¿£ Ï á¿ Î¯ á¿¢ Ï á¿ Î¯ á¿§ Ï}; sub poly2mono { if ($OLD_PERL or ! is_utf8($_[0])) { goto &_poly2mono; } else { decode_utf8(_poly2mono(encode_utf8($_[0]))); # Yes, I know this is inefficient. I might rewrite _poly2mono some day. } } sub _poly2mono { # the guts my($orig) = $_[0]; my($newstring,$thischar); my($fsyl,$fphon,$lsyl,$prevvowel); # first syllable, first phoneme, last syllable, previous vowel my(@lexis); while($orig =~ s/$C//) { $thischar = $&; if ($thischar =~ /^$diacr$/) { # current pos my($cpos) = $thischar =~ /^$ui$/ && @lexis && $lexis[$#lexis] =~ /^$diphpre$/ ? $#lexis-1 : $#lexis; $fphon=$prevvowel=''; $fsyl=$lsyl=1; if ($lexis[$#lexis] !~ /^$gramma$/ or !scalar @lexis or $cpos<$#lexis && 2>scalar @lexis) { $fphon=1; } else{ foreach (reverse 0..$cpos){ if ($lexis[$_] =~ /^$gramma$/ && $lexis[$_] !~ /^$conson$/){ $prevvowel=$lexis[$_]; $fsyl='';last; } elsif ($lexis[$_] !~ /^$gramma$/){ last; } } } my($nnn)=0; my($lll); for(;$orig =~ /$C {$nnn}($C)/x;++$nnn){ $lll = $1; if($1 =~ /^$gramma$/ && $lll !~ /^$conson$/){ $lsyl='';last; }elsif($lll !~ /^$gramma$/){ last; } } #print "$thischar ", $fphon && "fphon ", $fsyl && "fsyl ", $lsyl && 'lsyl ', "prevvowel: $prevvowel<br>"; if ($thischar =~ /^$psiliaccent_lc$/ && !$fphon && (!$fsyl or !$lsyl)) { $newstring .=($remove{$thischar} || $thischar) . ' Î'; # Accentuation exceptions are dealt with here: }elsif ($thischar eq 'ῦ' and join('',@lexis) =~ /^(?:Î |Ï)ο$/ and $orig !~ /^$gramma/){ $newstring .= 'Ï'; } elsif ($thischar eq 'á¿¶' and join('',@lexis) =~ /Î |Ï$/ and $orig =~ /^Ï(?!$gramma)/) { $newstring .= 'Ï'; } elsif ($thischar =~ /^(?:á¼¢|ἤ)/ and !@lexis and $orig =~ /^(?!$gramma)/) { $newstring .= 'ή'; } elsif ($thischar =~ /^(?:á½°|á½±)/ and join('',@lexis) =~ /(?:Î|γ|Î |Ï)ι$/ and $orig =~ /^(?!$gramma)/) { $newstring .= 'α'; } elsif ($thischar =~ /^(?:ὸ|á½¹)/ and join('',@lexis) =~ /(?:Î |Ï)ι$/ and $orig =~ /^(?!$gramma)/) { $newstring .= 'ο'; } elsif (($fsyl and $lsyl) or ($prevvowel =~ /$accent/)){ $newstring .= $remove{$thischar} || $thischar; }elsif ($thischar =~ /${\join '|', keys %direm}/ && $lexis[$#lexis] !~ /^$diphpre$/ or $thischar =~ /Ï|á¿|á¿|á¿|Ϊ/ && $lexis[$#lexis] !~ /Î|Î|Î|Î¥|α|ε|ο|Ï / or $thischar =~ /Ï|á¿£|á¿¢|á¿§|Ϋ/ && $lexis[$#lexis] !~ /Î|Î|Î|Î|α|ε|η|ο/){ $newstring .= $direm{$thischar}; } else { $newstring .= $p2m{$thischar}||$thischar } } else {$newstring .= $p2m{$thischar} || $thischar} if ($thischar =~ /^$gramma$/) { push @lexis, $thischar; } else { @lexis = ();} } return $newstring; } 1; __END__ I was going to put this in the man page, but I decided against it: # raw utf8 bytes: $mono = poly2mono "\xce\xa4\xce\xbf\xe1\xbd\x90\xce\xbb\xe1\xbd\xb1\xcf\x87\xce" . "\xb9\xcf\x83\xcf\x84\xce\xbf\xce\xbd \xce\xb8\xe1\xbd\xb3\xce" . "\xbb\xcf\x89 \xce\xbd\xe1\xbc\x84\xcf\x83\xcf\x84\xce\xb1\xce" . "\xb9 \xce\xba\xce\xb1\xce\xbb\xe1\xbd\xb1!"; # $mono now contains # "\xce\xa4\xce\xbf\xcf\x85\xce\xbb\xce\xac\xcf\x87\xce\xb9\xcf" # . "\x83\xcf\x84\xce\xbf\xce\xbd \xce\xb8\xce\xad\xce\xbb\xcf\x89" # . " \xce\xbd\xce\xb1 \xce\x84\xcf\x83\xcf\x84\xce\xb1\xce\xb9 " # . "\xce\xba\xce\xb1\xce\xbb\xce\xac!" # OR # Unicode string: $mono = poly2mono "\x{03a4}\x{03bf}\x{1f50}\x{03bb}\x{1f71}\x{03c7}\x{03b9}" . "\x{03c3}\x{03c4}\x{03bf}\x{03bd} \x{03b8}\x{1f73}\x{03bb}" . "\x{03c9} \x{03bd}\x{1f04}\x{03c3}\x{03c4}\x{03b1}\x{03b9} " . "\x{03ba}\x{03b1}\x{03bb}\x{1f71}!" # $mono now contains # "\x{03a4}\x{03bf}\x{03c5}\x{03bb}\x{03ac}\x{03c7}\x{03b9}" # . "\x{03c3}\x{03c4}\x{03bf}\x{03bd} \x{03b8}\x{03ad}\x{03bb}" # . "\x{03c9} \x{03bd}\x{03b1} \x{0384}\x{03c3}\x{03c4}\x{03b1}" # . "\x{03b9} \x{03ba}\x{03b1}\x{03bb}\x{03ac}!" =encoding utf-8 (no POD converter seems to support this, even though the perlpod man page has it listed)