Unicode::Decompose - Unicode decomposition and normalization


Unicode-Decompose documentation Contained in the Unicode-Decompose distribution.

Index


Code Index:

NAME

Top

Unicode::Decompose - Unicode decomposition and normalization

SYNOPSIS

Top

  use Unicode::Decompose qw(normalize order decompose recompose normalize_d);

  $norm   = normalize($string);

  # OR:

  $decomp  = decompose($string); 
  $ordered = order($decomp);
  $norm    = recompose($ordered);

DESCRIPTION

Top

This module implements Unicode normalization forms D and C.

These are important for comparing Unicode strings: consider the two strings "\N{LATIN SMALL LETTER E WITH ACUTE}", and "\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}". From one point of view, (simply looking at the characters and the bytes in the string) they're differnet; from another, (looking at the meaning of the characters) they're the same.

Normalization is the process described in Unicode Technical Report #15 by which these two strings are made equal. There are two modes of doing this that we particularly care about: Unicode Normalization Form D is the "weaker" form, and C the stronger form.

Both have two stages in common: In the first stage, the data is "pulled apart", or decomposed. That is, precomposed characters such as "LATIN SMALL LETTER E WITH ACUTE", are split into a main character and the combining characters that follow it. In the next stage, the combining characters are ordered according to a list of priorities defined in the Unicode Character Database. This will make our two example strings both "LATIN SMALL LETTER E, COMBINING ACUTE ACCENT", and will hence compare equal.

Unicode Normalization Form C then takes the resulting string and "pushes together" the data, recomposing it; that is, characters may be returned to precomposed forms - because the combining characters have been rearranged, this might not be the same as the original precomposed characters.

Support for compatiblity decomposition, which is considerably more relaxed about how characters decompose, is implemented at a very rough level but not made available to the end-user at this time. If you want it, you should be able to figure it out from the code.

BUGS

Creating the initial data structures is slow. Maybe move to Storable; the "cooked" data could be installed with the module.

EXPORT

See list in synopsis.

AUTHOR

Top

Simon Cozens, simon@cpan.org

SEE ALSO

Top

perl.


Unicode-Decompose documentation Contained in the Unicode-Decompose distribution.

package Unicode::Decompose;

use lib "../../../lib";
use 5.006;
use strict;
use warnings;
use utf8;
use UnicodeCD qw(charinfo compexcl);

require Exporter;
require DynaLoader;

our @ISA = qw(Exporter DynaLoader);

our %EXPORT_TAGS = ( 'all' => [ qw(
	normalize normalize_d order decompose recompose
) ] );

our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );

our @EXPORT = qw();
our $VERSION = '0.02';

our %compat;
our %canon;

# First, load up the decomposition table
my @decomp = split /\n/, require "unicode/Decomposition.pl" or die $@;
my %decomp;
while ($_=pop @decomp) {
    chomp;
    my ($char,@line) = split;
    $decomp{$char} = [
        map { exists $decomp{$_} ? @{$decomp{$_}} : $_ } @line
    ];
}

# Then remove the recursion from it
my $changed;
do {
    $changed = 0;
    while (my($k,$v) = each %decomp) {
        $decomp{$k} = [ map { exists $decomp{$_} ? do { $changed++; @{$decomp{$_}} } : $_ } @$v ];
    }
} while $changed;

# Now split it into canon and compat
while (my($k,$v) = each %decomp) {
    $compat{pack("U*", hex $k)} = pack "U*", map { hex $_} grep !/</, @$v;
    next if "@$v" =~ /</;
    $canon{pack("U*", hex $k)} = pack "U*", map { hex $_} @$v;
}

my @ok_keys;
# Recomposition
for (keys %canon) {
    my $x = $_;
    next if (compexcl ord $x); # compexcl eats $_, bluh.
    push @ok_keys, $x;
}

my %recomp = map { $canon{$_} => $_ } @ok_keys;
my $recomppat = join "|", reverse sort keys %recomp;

bootstrap Unicode::Decompose $VERSION;

sub normalize {
    my $a = shift;
    _decompose($a, "canon");
    return recompose(order($a));
}
sub normalize_d {
    my $a = shift;
    _decompose($a, "canon");
    return order($a);
}

sub decompose {
    my $a = shift;
    decompose($a, "canon");
    return $a;
}

sub order {
    my $str = shift;
    my @chars = ($str =~ /(\X)*/);
    my $result;
    for (@chars) {
        # my ($head, @others) = /^(\PM)(\pM)*$/; Doesn't work, damn!
        s/^(\PM)// or die ;
        my $head = $1;
        my @others;
        push @others, $1 while s/^(\pM)//;
        $result .= $head;
        next unless @others;
        $result .= join '', sort { charinfo(ord $a)->{combining} <=> charinfo(ord($b))->{combining} } @others;
    }
    return $result;
}

sub recompose {
    my $a = shift;
    $a =~ s/($recomppat)/$recomp{$1}/g;
    return $a;
}

1;
__END__
# Below is stub documentation for your module. You better edit it!