Text::Bidi - Unicode bidi algorithm using libfribidi


Text-Bidi documentation Contained in the Text-Bidi distribution.

Index


Code Index:

NAME

Top

Text::Bidi - Unicode bidi algorithm using libfribidi

SYNOPSIS

Top

    use Text::Bidi;
    $visual = log2vis($logical);

    ($visual, $paradir, $l2v, $v2l, $embedding) =
        log2vis($logical, $paradir);




EXPORT

Top

The following functions can be exported:

All of them can be exported together using the :all tag.

Description

Top

This module provides basic support for the Unicode bidirectional text (Bidi) algorithm, for displaying text consisting of both left-to-right and right-to-left written languages (like Hebrew and Arabic.) It does so using a swig interface file to the libfribidi library.

Though several libfribidi functions are provided by the swig interface file, the standard usage of this module is provided by one function, log2vis(), that translates a logical string into a visual one. In addition, there are several utility functions, and some functions that implement part of the algorithm (see "Comparison with libfribidi and FriBidi.pm" for the reason this is needed.)

The object oriented approach

All functions here can be called using either a procedural or an object oriented approach. For example, you may do either

        $visual = log2vis($logical);

or

        $bidi = new Text::Bidi;
        $visual = $bidi->log2vis($logical);

The advantages of the second form is that it is easier to move to a sub-class, and that two or more objects with different parameters can be used simultaneously.

If you do sub-class this class, and want the procedural interface to use your functions, put a line like

        $Text::Bidi::GlobalClass = __PACKAGE__;

in your module.

Types and Namespaces

Top

The following constants are imported from the fribidi library:

In addition, the hash %Mirrored maps mirrored characters to their counter parts, and the scalar $Mirrored is a pattern that matches one mirrored character.

Functions

Top

The following functions are of interest to the user

new()

Create a new instance of a bidi translator. The following key-value parameters are allowed:

width

The width, in characters, of the displayed string. This affects the reordering algorithm. The default is undef, which will assume that no line-breaking happens.

reset

A string of the characters that function as field (segment) separators. The default is "\x{2029}\x{09}\x{11}", which is (to my understanding) the unicode specification.

clean

If true, log2vis() will remove any explicit bidi marks in the visual string, and adjust the mapping arrays accordingly. Default is true.

These parameters can be accessed using get_width(), set_width() and similar functions.

get_width()

set_width()

get_reset()

set_reset()

get_clean()

set_clean()

Query or set the values of the corresponding parameters. See new() for details about the parameters.

log2vis()

This function provides the main functionality. It can be called, similarly to fribidi_log2vis of the fribidi library, as follows:

        ($vis, $dir, $l2v, $v2l, $levels) =
            log2vis($log[, $dir[, $width]])

The arguments are:

$log

The logical string

$dir

Override the base direction of the paragraph. The possible values are $Text::Bidi::Type::RTL, $Text::Bidi::Type::LTR or $Text::Bidi::Type::ON. The default, if not given, is $Text::Bidi::Type::ON, which means that the direction should be determined according to the bidi algorithm.

$width

The width at which the string is broken. This overrides, and has the same meaning, as the width parameter set by set_width(). As with that parameter, a value of undef means that no line breaking should be done.

The outputs are as follows:

$vis

The visual string. In scalar context, this is the only parameter returned (and in this case the function may work slightly faster.)

$l2v

An arrayref representing the map from the logical string to the visual one, so the $i-th character of the logical string will be in position $l2v-[$i]> of the visual one.

$v2l

The inverse function, mapping characters in the visual string to the logical one.

$levels

The embedding levels - an arrayref assigning to each character of the logical string the nesting level of text of different directions to which it belongs. Pure left-to-right text has embedding level 0. A character is left-to-right (within this string) iff it has even embedding level.

Functions implementing parts of the algorithm

The following functions, that implement parts of the algorithm, are used by log2vis()

levels2intervals()

This function accepts an arrayref of embedding levels and returns an arrayref that, at place $i, contains a hash of intervals (to the index of the start of the interval it assigns the index of the end of it), such that each of them is a maximal interval of embedding levels at least $i. For example, to the embedding levels:

        0011122111333220001

we get

        [
        { 0 => 18 },
        { 2 => 14, 18 => 18 },
        { 5 => 6, 10 => 14 },
        { 10 => 12 }
        ]

reorder()

This function implements the reordering part of the bidi algorithm (section 3.4, L1-L4.) The input is the logical string, the (arrayref of) embedding levels, the base dir of the paragraph, a position in the logical string, and a length. The default for the position is 0, and for the length till the end of the string. The function will return the v2l mapping, the modified embedding levels, the intervals for these levels (as computed by levels2intervals()) and the visual string, all for the part of the string given by the position and the length, and assuming that the string is broken after this segment. In scalar context, only the visual string is returned.

invert()

Compute the inverse of a function given by an array. This is used to convert the $v2l mapping to $l2v.

Utility functions

The following functions are available mainly for testing. See also Text::Bidi::CapRTL for a possibly simpler interface.

caprtl_to_unicode()

Convert a string where right-to-left text is represented by capital letters, and bidi marks by control sequences, to a string with actual right-to-left characters and bidi marks. The control sequences are of the form _C, where C is a character. Run

        fribidi --charsetdesc CapRTL

for a description of the translation table.

unicode_to_caprtl()

Perform the inverse of caprtl_to_unicode()

Comparison with libfribidi and FriBidi.pm

Top

The module has mostly the same interface as FriBidi, the module written originally with the fribidi library. The main differences are:

BUGS

Top

The caprtl_to_unicode() and unicode_to_caprtl() functions currently do not work, because of what appears to be a bug in libfribidi. The details are in https://bugs.freedesktop.org/show_bug.cgi?id=8040.

SEE ALSO

Top

Text::Bidi::CapRTL, Encode

The fribidi library: http://fribidi.org/, http://imagic.weizmann.ac.il/~dov/freesw/FriBidi/

Swig: http://www.swig.org

The unicode bidi algorithm: http://www.unicode.org/unicode/reports/tr9/

AUTHOR

Top

Moshe Kamensky, mailto:kamensky@cpan.org

COPYRIGHT & LICENSE

Top


Text-Bidi documentation Contained in the Text-Bidi distribution.
package Text::Bidi;

use warnings;
use strict 'vars';
use Exporter;
use base qw(Exporter);
use Carp;

our $VERSION = '0.03';


use Encode;
use Text::Bidi::private;
use Text::Bidi::mirrored;

BEGIN {
    our %EXPORT_TAGS = (
        'all' => [ qw(
            log2vis
            caprtl_to_unicode
            unicode_to_caprtl
            set_width
            get_width
            set_reset
            get_reset
            set_clean
            get_clean
        ) ],
    );
    our @EXPORT_OK = ( @{$EXPORT_TAGS{'all'}} );
    our @EXPORT = qw(); ## no critic
}

# The following mechanism is used to provide both kinds of interface: Every 
# method starts with 'my $self = S(@_)' instead of 'my $self = shift'. S 
# shifts and returns the object if there is one, or returns a global object, 
# stored in $Global, if there is in @_. The first time $Global is needed, it 
# is created with type $GlobalClass.

my $Global;
our $GlobalClass = __PACKAGE__;

sub S(\@) {
    my $l = shift;
    my $s = $l->[0];
    return shift @$l if eval { $s->isa('Text::Bidi') };
    $Global = new $GlobalClass unless $Global;
    $Global
}

foreach ( keys %Text::Bidi::private:: ) {
    *{"Text::Bidi::Type::$1"} = *{"Text::Bidi::private::$_"} 
        if /^FRIBIDI_TYPE_([A-Z]*)$/;
    *{"Text::Bidi::Mask::$1"} = *{"Text::Bidi::private::$_"} 
        if /^FRIBIDI_MASK_([A-Z]*)$/;
    no warnings 'once'; ## no critic
    ${"Text::Bidi::Unicode::$1"} = chr(${"Text::Bidi::private::$_"})
        if /^UNI_([A-Z]*)$/;
}

sub new {
    my $class = shift;
    my $self = {
        'reset' => "\x{2029}\x{09}\x{11}",
        'clean' => 1,
        'width' => undef,
        @_ };
    bless $self => $class;
}

for my $param ( qw(width reset clean) ) {
    *{"set_$param"} = sub {
        my $s = S(@_);
        $s->{$param} = shift;
    };

    *{"get_$param"} = sub {
        my $s = S(@_);
        $s->{$param}
    };
}

sub log2vis {
    my $self = S(@_);
    my $str = shift;
    my $width = exists $_[1] ? pop : $self->get_width;
    my $wa = wantarray;
    my $in = encode('UTF-32LE', $str);
    my ($res, $Dir, $L2v, $V2l, $Levels);
    if ( $width and $width < length($str) ) {
        # log2vis doesn't take the display width into account, so we have to 
        # do it ourselves. Get just the embedding levels
        ($Dir, my $emb) = 
            Text::Bidi::private::fribidi_log2vis_get_embedding_levels($in, @_);
        my $i = 0;
        while ( $i < length($str) ) {
            my $vis;
            if ($wa) {
                my ($v2l, $levels, $int);
                ($v2l, $levels, $int, $vis) = 
                    $self->reorder($str, $emb, $Dir, $i, $width);
                push @$V2l, @$v2l;
                push @$Levels, @$levels;
            } else {
                $vis = $self->reorder($str, $emb, $Dir, $i, $width);
            }
            $res .= $vis;
            $i += $width;
        }
        $res = encode('UTF-32LE', $res) if $self->get_clean;
    } else {
        # no width required - may use log2vis from fribidi
        if ($wa) {
            ($res, $Dir, $L2v, $V2l, $Levels) = 
                Text::Bidi::private::log2vis($in, @_);
        } else {
            $res = Text::Bidi::private::log2vis($in, @_);
        }
        $res = decode('UTF-32LE', $res) unless $self->get_clean;
    }
    if ( $self->get_clean ) {
        ($res, $V2l) = 
            Text::Bidi::private::fribidi_remove_bidi_marks($res, $V2l);
        $res = decode('UTF-32LE', $res); 
        if ($V2l) {
            $L2v = $self->invert($V2l);
        }
    } elsif ( $wa and not $L2v ) {
        $L2v = $self->invert($V2l);
    }
    return $wa ? ($res, $Dir, $L2v, $V2l, $Levels) : $res;
}

sub levels2intervals {
    S(@_);
    my $lvl = shift;
    return [] unless @$lvl;
    my @int;
    push @{$int[$_]}, 0 foreach ( 0 .. $lvl->[0] );
    for my $i (0 .. $#$lvl - 1 ) {
        if ($lvl->[$i+1] >= $lvl->[$i]) {
            foreach ( $lvl->[$i] + 1 .. $lvl->[$i+1] ) {
                push @{$int[$_]}, $i+1;
            }
        } else {
            foreach ( $lvl->[$i+1] + 1 .. $lvl->[$i] ) {
                push @{$int[$_]}, $i;
            }
        }
    }
    push @{$int[$_]}, $#$lvl foreach ( 0 .. $lvl->[-1] );
    my $res;
    for my $i ( 0..$#int ) {
        %{$res->[$i]} = @{$int[$i]};
    }
    $res
}

sub _reorder {
    my $self = S(@_);
    my ($level, $from, $to) = @_;
    my @res;
    $level++;
    foreach ( sort { $a <=> $b } keys %{$self->{'_int'}[$level]} ) {
        next if $_ < $from;
        last if $_ > $to;
        my $t = $self->{'_int'}[$level]{$_};
        $t = $to if $t > $to;
        $self->_reorder($level, $_, $t);
        @{$self->{'_chars'}}[$_..$t] = reverse @{$self->{'_chars'}}[$_..$t];
        @{$self->{'_v2l'}}[$_..$t] = reverse @{$self->{'_v2l'}}[$_..$t] 
            if $self->{'_v2l'};
        $from = $t + 1;
    }
}

sub reorder {
    my $self = S(@_);
    my ($log, $levels, $dir, $from, $len) = @_;
    $from = 0 unless $from;
    $len = length($log) - $from if (not $len or $len > length($log) - $from);
    $dir = ($dir == $Text::Bidi::Type::RTL) ? 1 : 0;
    $self->{'_chars'} = [split '', substr($log, $from, $len)];
    $len--;
    my @levels = @$levels[$from .. $from + $len];
    $self->{'_v2l'} = wantarray ? [$from .. ($from+$len)] : undef;
    my $reset = $self->get_reset;
    $reset = qr/[\Q$reset\E]/o;
    # L1
    my $j = -1;
    for my $i ( 0 .. $len ) {
        if ( $self->{'_chars'}[$i] =~ $reset ) {
            $levels[$_] = $dir foreach ($j+1 .. $i);
            $j = $i;
        } elsif ( $self->{'_chars'}[$i] =~ /\S/ ) {
            $j = $i;
        }
    }
    $levels[$_] = $dir foreach ($j+1 .. $len );
    $self->{'_int'} = $self->levels2intervals(\@levels);

    # L2
    $self->_reorder(0, 0, $len);

    # L3, L4 TODO

    my $out = join('', @{$self->{'_chars'}});
    wantarray ? ($self->{'_v2l'}, \@levels, $self->{'_int'}, $out) : $out;
}

sub invert {
    S(@_);
    my @res;
    @res[@{$_[0]}] = 0..$#{$_[0]};
    \@res
}

sub caprtl_to_unicode {
    Text::Bidi::private::fribidi_char_set_enter_cap_rtl();
    decode('UTF-32LE', Text::Bidi::private::fribidi_cap_rtl_to_unicode(@_))
}

sub unicode_to_caprtl {
    Text::Bidi::private::fribidi_char_set_enter_cap_rtl();
    Text::Bidi::private::fribidi_unicode_to_cap_rtl(encode('UTF-32LE', shift));
}

1; # End of Text::Bidi