Unicode::Property::XS - Unicode properties implemented by lookup table in C code.


Unicode-Property-XS documentation Contained in the Unicode-Property-XS distribution.

Index


Code Index:

NAME

Top

Unicode::Property::XS - Unicode properties implemented by lookup table in C code.

SYNOPSIS

Top

  use Unicode::Property::XS qw(:all); # 'ucs_' is the default prefix

  my @property_letters;
  foreach my $ord (0x0000..0x37FF) { 
      push @property_letters, ucs_L($ord);    # /\p{L}/ 
  };
  my @property_list = ucs_EaFullwidth1(0x0000..0x37FF);

  foreach my $ord (0x0000..0x3FFFF) {
      next if !ucs_Legal($ord);
      die "Internal error!" if ucs_M($ord) != ((chr($ord) =~ /\p{M}/) ? 1 : 0);
  }

  my @myChars = q( a b c d e f g 1 2 3 );
  my @property_list2 = ucs_L( ord(@myChars) );

  __END__

  #################################

  BEGIN { Unicode::Property::XS::Prefix = 'Is'; }
  use Unicode::Property::XS;

  my @property_letters;
  foreach my $ord (0x0000..0x37FF) { 
      push @property_letters, IsL($ord);    # /\p{L}/ 
  };

  __END__

   #################################

   use Unicode::Property::XS qw( Legal :EastAsianWidth );
   use Unicode::EastAsianWidth;
   BEGIN { $Unicode::EastAsianWidth::EastAsian = 0; };

   foreach my $ord (0x0000..0xEFFFF) {
       next if !ucs_Legal($ord) ; 
       my $lookup_value = ucs_EaFullwidth0($ord);    # /\p{InFullwidth}
       my $re_value = chr($ord)=~/\p{InFullwidth}/ ;
       die "Error in Unicode::Property::XS!\n" if !($lookup_value == $re_value) ;
   };

   __END__

DESCRIPTION

Top

Unicode properties for regular expression in perl is handy. But it's somehow slow when the times of repetition is sparse for a given word. So, I made a table lookup XS module for property lookup. The "Unicoae Character Properties" section of perlunicode and properties in Unicode::EastAsianWidth is implemented.

The bundle costs 1.2MB for run time dynamic library, and include all the property class listed below. please tell me if you module-spliting or space-saving solutions.

All the functions except ucs_Legal() work the same way. Return 1 if the input character (in numeric value) is in that property class. Return 0 if not. Return 0 if the encoding value is illegal (should not happen if the input value is converted by ord($ucs_char)). Return 15 if in plane 15, a user-defined plane. Return 16 if in plane 16, a user-defined plane.

And ucs_Legal() returns 1 if perl will not complain chr($ucs_ord), and 0, otherwise.

The following functions can be exported to the caller's scope.ucs_Legal().

Functions for general properties:ucs_L(), ucs_LC(), ucs_Lu(), ucs_Ll(), ucs_Lt(), ucs_Lm(), ucs_Lo(),ucs_M(), ucs_Mn(), ucs_Mc(), ucs_Me(),ucs_N(), ucs_Nd(), ucs_Nl(), ucs_No(),ucs_P(), ucs_Pc(), ucs_Pd(), ucs_Ps(), ucs_Pe(), ucs_Pi(), ucs_Pf() ucs_Po(),ucs_S(), ucs_Sm(), ucs_Sc(), ucs_Sk(), ucs_So(),ucs_Z(), ucs_Zs(), ucs_Zl(), ucs_Zp(),ucs_C(), ucs_Cc(), ucs_Cf(), ucs_Cs(), ucs_Co(), ucs_Cn(),

Functions for bidirectional properties:ucs_BidiL(), ucs_BidiLRE(), ucs_BidiLRO(), ucs_BidiR(), ucs_BidiAL(), ucs_BidiRLE(), ucs_BidiRLO(), ucs_BidiPDF(), ucs_BidiEN(), ucs_BidiES(), ucs_BidiET(), ucs_BidiAN(), ucs_BidiCS(), ucs_BidiNSM(), ucs_BidiBN(), ucs_BidiB(), ucs_BidiS(), ucs_BidiWS(), ucs_BidiON().

Functions for scripts ( properties PhagsPa, Phoenician, are not included since they are not implemented in /\p{ }/ form. ):ucs_Arabic(), ucs_Armenian(), ucs_Balinese(), ucs_Bengali(), ucs_Bopomofo(), ucs_Braille(), ucs_Buginese(), ucs_Buhid(), ucs_CanadianAboriginal(), ucs_Cherokee(), ucs_Coptic(), ucs_Cuneiform(), ucs_Cypriot(), ucs_Cyrillic(), ucs_Deseret(), ucs_Devanagari(), ucs_Ethiopic(), ucs_Georgian(), ucs_Glagolitic(), ucs_Gothic(), ucs_Greek(), ucs_Gujarati(), ucs_Gurmukhi(), ucs_Han(), ucs_Hangul(), ucs_Hanunoo(), ucs_Hebrew(), ucs_Hiragana(), ucs_Inherited(), ucs_Kannada(), ucs_Katakana(), ucs_Kharoshthi(), ucs_Khmer(), ucs_Lao(), ucs_Latin(), ucs_Limbu(), ucs_LinearB(), ucs_Malayalam(), ucs_Mongolian(), ucs_Myanmar(), ucs_NewTaiLue(), ucs_Nko(), ucs_Ogham(), ucs_OldItalic(), ucs_OldPersian(), ucs_Oriya(), ucs_Osmanya(), ucs_PhagsPa(), ucs_Phoenician(), ucs_Runic(), ucs_Shavian(), ucs_Sinhala(), ucs_SylotiNagri(), ucs_Syriac(), ucs_Tagalog(), ucs_Tagbanwa(), ucs_TaiLe(), ucs_Tamil(), ucs_Telugu(), ucs_Thaana(), ucs_Thai(), ucs_Tibetan(), ucs_Tifinagh(), ucs_Ugaritic(), ucs_Yi().

Functions for extended properties:ucs_ASCIIHexDigit(), ucs_BidiControl(), ucs_Dash(), ucs_Deprecated(), ucs_Diacritic(), ucs_Extender(), ucs_HexDigit(), ucs_Hyphen(), ucs_Ideographic(), ucs_IDSBinaryOperator(), ucs_IDSTrinaryOperator(), ucs_JoinControl(), ucs_LogicalOrderException(), ucs_NoncharacterCodePoint(), ucs_OtherAlphabetic(), ucs_OtherDefaultIgnorableCodePoint(), ucs_OtherGraphemeExtend(), ucs_OtherIDStart(), ucs_OtherIDContinue(), ucs_OtherLowercase(), ucs_OtherMath(), ucs_OtherUppercase(), ucs_PatternSyntax(), ucs_PatternWhiteSpace(), ucs_QuotationMark(), ucs_Radical(), ucs_SoftDotted(), ucs_STerm(), ucs_TerminalPunctuation(), ucs_UnifiedIdeograph(), ucs_VariationSelector(), ucs_WhiteSpace().

Functions for derived properties:ucs_Alphabetic(), ucs_Lowercase(), ucs_Uppercase(), ucs_Math(), ucs_IDStart(), ucs_IDContinue(), ucs_Any(), ucs_Assigned(), ucs_Unassigned(), ucs_ASCII(), ucs_Common().

Functions for EastAsianWidth:ucs_EaF(), ucs_EaH(),ucs_EaA(), ucs_EaNa(),ucs_EaW(), ucs_EaN(),ucs_EaFullwidth0(), ucs_EaHalfwidth0(),ucs_EaFullwidth1(), ucs_EaHalfwidth1().

While considering about classification of InEastAsianAmbiguous category in InFullwidth and InHalfwidth, ucs_EaFullwidth0() and ucs_EaHalfwidth0() represent the InFullwidth class and InHalfwidth class with $Unicode::EastAsianWidth::EastAsian = 0. On the contrary, ucs_EaFullwidth1() and ucs_EaHalfwidth1() with $Unicode::EastAsianWidth::EastAsian = 1. The actual value of $Unicode::EastAsianWidth::EastAsian is irrelevant to them since the lookup table is premade.

In my line-warping program, the total running time is cut half by using this module, comparing to original regex version, i.e. /\p{ }/ family. At the same time, caching the regex result doesn't help much. But it shows only 20%-50% performance difference in benchmark module.

EXPORT

SEE ALSO

Top

# Mention other useful documentation such as the documentation of # related modules or operating system documentation (such as man pages # in UNIX), or any relevant external documentation such as RFCs or # standards.

# If you have a mailing list set up for your module, mention it here.

# If you have a web site set up for your module, mention it here.

perlunicode, Unicode::EastAsianWidth, http://www.unicode.org/unicode/reports/tr11/, http://unicode.org/Public/UNIDATA/EastAsianWidth.txt

AUTHOR

Top

Mindos Cheng, <mindos@gmail.com>

COPYRIGHT AND LICENSE

Top


Unicode-Property-XS documentation Contained in the Unicode-Property-XS distribution.

package Unicode::Property::XS;

use 5.008;
use strict;
use warnings;
use vars qw( $VERSION );

#require Exporter;
#our @ISA = qw(Exporter);
# use Exporter::Lite;
#our (@ISA, @EXPORT, @EXPORT_OK, %EXPORT_TAGS);
our $Prefix;
BEGIN {
    $VERSION = '0.81';
}

# This allows declaration   use Unicode::Property::XS ':all';


our @general = (
         'L', 'LC', 'Lu', 'Ll',
         'Lt', 'Lm', 'Lo', 'M',
         'Mn', 'Mc', 'Me', 'N',
         'Nd', 'Nl', 'No', 'P',
         'Pc', 'Pd', 'Ps', 'Pe',
         'Pi', 'Pf', 'Po', 'S',
         'Sm', 'Sc', 'Sk', 'So',
         'Z', 'Zs', 'Zl', 'Zp',
         'C', 'Cc', 'Cf', 'Cs',
         'Co', 'Cn' ); 
our @bidirectional = (
         'BidiL', 'BidiLRE', 'BidiLRO', 'BidiR',
         'BidiAL', 'BidiRLE', 'BidiRLO', 'BidiPDF',
         'BidiEN', 'BidiES', 'BidiET', 'BidiAN',
         'BidiCS', 'BidiNSM', 'BidiBN', 'BidiB',
         'BidiS', 'BidiWS', 'BidiON' ); 
our @scripts = (
         'Arabic', 'Armenian', 'Balinese', 'Bengali',
         'Bopomofo', 'Braille', 'Buginese', 'Buhid',
         'CanadianAboriginal', 'Cherokee', 'Coptic', 'Cuneiform',
         'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari',
         'Ethiopic', 'Georgian', 'Glagolitic', 'Gothic',
         'Greek', 'Gujarati', 'Gurmukhi', 'Han',
         'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana',
         'Inherited', 'Kannada', 'Katakana', 'Kharoshthi',
         'Khmer', 'Lao', 'Latin', 'Limbu',
         'LinearB', 'Malayalam', 'Mongolian', 'Myanmar',
         'NewTaiLue', 'Nko', 'Ogham', 'OldItalic',
         'OldPersian', 'Oriya', 'Osmanya', 'PhagsPa',
         'Phoenician', 'Runic', 'Shavian', 'Sinhala',
         'SylotiNagri', 'Syriac', 'Tagalog', 'Tagbanwa',
         'TaiLe', 'Tamil', 'Telugu', 'Thaana',
         'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic',
         'Yi' ); 
our @extended = (
         'ASCIIHexDigit', 'BidiControl', 'Dash', 'Deprecated',
         'Diacritic', 'Extender', 'HexDigit', 'Hyphen',
         'Ideographic', 'IDSBinaryOperator', 'IDSTrinaryOperator', 'JoinControl',
         'LogicalOrderException', 'NoncharacterCodePoint', 'OtherAlphabetic', 'OtherDefaultIgnorableCodePoint',
         'OtherGraphemeExtend', 'OtherIDStart', 'OtherIDContinue', 'OtherLowercase',
         'OtherMath', 'OtherUppercase', 'PatternSyntax', 'PatternWhiteSpace',
         'QuotationMark', 'Radical', 'SoftDotted', 'STerm',
         'TerminalPunctuation', 'UnifiedIdeograph', 'VariationSelector', 'WhiteSpace' ); 
our @derived = (
         'Alphabetic', 'Lowercase', 'Uppercase', 'Math',
         'IDStart', 'IDContinue', 'Any', 'Assigned',
         'Unassigned', 'ASCII', 'Common' ); 
our @EastAsianWidth = (
         'EaF', 'EaH', 'EaA', 'EaNa',
         'EaW', 'EaN', 'EaFullwidth0', 'EaFullwidth1',
         'EaHalfwidth0', 'EaHalfwidth1' ); 

our %EXPORT_TAGS = ( 
        'all' => [ 'Legal',
        @general,@bidirectional,@scripts,@extended,
        @derived,@EastAsianWidth ],

        'general' => [ @general ],
        'bidirectional' => [ @bidirectional ],
        'scripts' => [ @scripts ],
        'extended' => [ @extended ],
        'derived' => [ @derived ],
        'EastAsianWidth' => [ @EastAsianWidth ],
        );


#    ucs_InEastAsianFullwidth
#    ucs_InEastAsianHalfwidth
#    ucs_InEastAsianAmbiguous
#    ucs_InEastAsianNarrow
#    ucs_InEastAsianWide
#    ucs_InEastAsianNeutral
#    ucs_InFullwidth
#    ucs_InHalfwidth
#      for ucs_InFullwidth see context $Unicode::EastAsianWidth::EastAsian
#      for ucs_InHalfwidth see context $Unicode::EastAsianWidth::EastAsian

our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );

our @EXPORT = qw();

sub import {
    my ($pkg, @imports) = @_;
    my($caller, $file, $line) = caller;

    $Prefix = defined($Prefix) ? $Prefix : 'ucs_' ;
    $Prefix =~ s/[^A-Za-z0-9_]//g ;                    # strip possible weird chars in prefix 

    # my @tags;
    my @items;

    # adapted from Exporter::Lite
    if ( !@imports ) {        # Default import.
        @imports = @EXPORT;
    }
    my %ok = map { s/^&//; $_ => 1; } @EXPORT_OK, @EXPORT;
    my %ok_tag = map { $_ => 1; } keys %EXPORT_TAGS;
    my $add;

    ITEM:
    foreach my $item (@imports) {
        $add = $item =~ s/^!// ? 1 : 2;

        if ($item eq ':DEFAULT') {
            map { $ok{$_}=$add; } @EXPORT ;
            next ITEM;
        }

        if ($item =~ /^:(.*)/) {
            if (!$ok_tag{$1}) {
                _report_error($1);
                next ITEM;
            };
            map { $ok{$_}=$add; } @{ $EXPORT_TAGS{$1} } ;
        }
        else {
            if (!$ok{$item}) {
                _report_error($item);
                next ITEM;
            };
            $ok{$item}=$add;
        }
    };

    foreach my $item (keys %ok) {
        next if $ok{ $item } != 2 ;

        do { no strict;
            *{ $caller.'::'.$Prefix.$item } = \&{ $item };
        };
    };

};

sub _report_error {
    my $item = shift;
    do { require Carp; Carp::croak("Can't export symbol: $item") };
};

require XSLoader;
XSLoader::load('Unicode::Property::XS', $VERSION);

# Preloaded methods go here.

1;
__END__