Search::Tools::UTF8 - UTF8 string wrangling


Search-Tools documentation Contained in the Search-Tools distribution.

Index


Code Index:

NAME

Top

Search::Tools::UTF8 - UTF8 string wrangling

SYNOPSIS

Top

 use Search::Tools::UTF8;

 my $str = 'foo bar baz';

 print "bad UTF-8 sequence: " . find_bad_utf8($str)
    unless is_valid_utf8($str);

 print "bad ascii byte at position " . find_bad_ascii($str)
    unless is_ascii($str);

 print "bad latin1 byte at position " . find_bad_latin1($str)
    unless is_latin1($str);

DESCRIPTION

Top

Search::Tools::UTF8 supplies common UTF8-related functions.

FUNCTIONS

Top

byte_length( text )

Returns the number of bytes in text regardless of encoding.

is_valid_utf8( text )

Returns true if text is a valid sequence of UTF-8 bytes, regardless of how Perl has it flagged (is_utf8 or not).

is_ascii( text )

If text contains no bytes above 127, then returns true (1). Otherwise, returns false (0). Used by convert() internally to check text prior to transliterating.

is_latin1( text )

Returns true if text lies within the Latin1 charset.

NOTE: Only Latin1 octets with a valid representable character are checked. Octets in the range \x80 - \x9f are not considered valid Latin1 and if found in text, is_latin1() will return false.

CAUTION: A string of bytes can be both valid Latin1 and valid UTF-8, even though the string doesn't represent the same Unicode codepoint(s). Example:

 my $str = "\x{d9}\x{a6}";  # same as \x{666}
 is_valid_utf8($str);       # returns true
 is_latin1($str);           # returns true

Thus is_latin1() (and likewise find_bad_latin1()) are not foolproof. Use them in combination with is_flagged_utf8() to get a better test.

is_flagged_utf8( text )

Returns true if Perl thinks text is UTF-8. Same as Encode::is_utf8().

is_perl_utf8_string( text )

Wrapper around the native Perl is_utf8_string() function. Called by is_valid_utf8().

is_sane_utf8( text [,warnings] )

Will test for double-y encoded text. Returns true if text looks ok. From Text::utf8 docs:

 Strings that are not utf8 always automatically pass.

Pass a second true param to get diagnostics on stderr.

find_bad_utf8( text )

Returns string of bad bytes from text. This of course assumes that text is not valid UTF-8, so use it like:

 croak "bad bytes: " . find_bad_utf8($str) 
    unless is_valid_utf8($str);

If text is a valid UTF-8 string, returns undef.

find_bad_ascii( text )

Returns position of first non-ASCII byte or -1 if text is all ASCII.

find_bad_latin1( text )

Returns position of first non-Latin1 byte or -1 if text is valid Latin1.

find_bad_latin1_report( text )

Returns position of first non-Latin1 byte (like find_bad_latin1()) and also carps about what the decimal and hex values of the bad byte are.

to_utf8( text, charset )

Shorthand for running text through appropriate is_*() checks and then converting to UTF-8 if necessary. Returns text encoded and flagged as UTF-8.

Returns undef if for some reason the encoding failed or the result did not pass is_sane_utf8().

looks_like_cp1252( text )

This function tests that there are bytes in text between 0x80 and 0x9f inclusive. Those bytes are used by the Windows-1252 character set and include some of the troublesome characters like curly quotes.

See also fix_cp1252_codepoints_in_utf8() and the Search::Tools::Transliterate convert1252() method.

fix_cp1252_codepoints_in_utf8( text )

The Windows-1252 codepoints between 0x80 and 0x9f may be encoded validly as UTF-8 but the Unicode standard does not map any characters at those codepoints. fix_cp1252_codepoints_in_utf8() converts a UTF-8 encoded string text to map the suspect 1252 codepoints to their correct Unicode representations.

Note that fix_cp1252_codepoints_in_utf8() is different from the fix_latin() function used in Transliterate, which does not differentiate between a Windows-1252 encoded string and a UTF-8 encoded string.

This function will croak if text does not pass is_valid_utf8().

debug_bytes( text )

Iterates over each byte in text, printing byte, hex and decimal values to stderr.

AUTHOR

Top

Peter Karman <karman@cpan.org>

Originally based on the HTML::HiLiter regular expression building code, by the same author, copyright 2004 by Cray Inc.

Thanks to Atomic Learning www.atomiclearning.com for sponsoring the development of some of these modules.

Many of the UTF-8 tests come directly from Test::utf8.

BUGS

Top

Please report any bugs or feature requests to bug-search-tools at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tools. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

Top

You can find documentation for this module with the perldoc command.

    perldoc Search::Tools




You can also look for information at:

* RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=Search-Tools

* AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/Search-Tools

* CPAN Ratings

http://cpanratings.perl.org/d/Search-Tools

* Search CPAN

http://search.cpan.org/dist/Search-Tools/

COPYRIGHT

Top

SEE ALSO

Top

HTML::HiLiter, SWISH::HiLiter, Rose::Object, Class::XSAccessor, Text::Aspell


Search-Tools documentation Contained in the Search-Tools distribution.

package Search::Tools::UTF8;
use strict;
use warnings;
use Carp;
use Search::Tools;    # XS stuff
use Encode;
use charnames ':full';
use Data::Dump qw( dump );
use base qw( Exporter );
our @EXPORT = qw(
    to_utf8
    is_valid_utf8
    is_flagged_utf8
    is_perl_utf8_string
    is_ascii
    is_latin1
    is_sane_utf8
    find_bad_utf8
    find_bad_ascii
    find_bad_latin1
    find_bad_latin1_report
    byte_length
    looks_like_cp1252
    fix_cp1252_codepoints_in_utf8
    debug_bytes
);

our $Debug = ( $ENV{PERL_DEBUG} && $ENV{PERL_DEBUG} > 2 ) ? 1 : 0;

our $VERSION = '0.59';

sub to_utf8 {
    my $str = shift;
    my $charset = shift || 'iso-8859-1';

    # checks first
    if ( is_flagged_utf8($str) ) {
        $Debug and carp "string '$str' is flagged utf8 already";
        return $str;
    }
    if ( is_valid_utf8($str) ) {
        my $newstr = Encode::decode_utf8( $str, 1 );
        $Debug and carp "string '$str' is valid utf8; utf8 flag turned on";
        return $newstr;
    }
    if ( is_ascii($str) ) {
        Encode::_utf8_on($str);
        $Debug and carp "string '$str' is ascii; utf8 flag turned on";
        return $str;
    }

    $Debug
        and carp "converting $str from $charset -> utf8";
    my $c = Encode::decode( $charset, $str );
    $Debug and carp "converted $c";

    unless ( is_sane_utf8( $c, 1 ) ) {
        carp "not sane: $c";
    }

    return $c;
}

sub is_flagged_utf8 {
    return Encode::is_utf8( $_[0] );
}

my $re_bit = join "|",
    map { Encode::encode( "utf8", chr($_) ) } ( 127 .. 255 );

#binmode STDERR, ":utf8";
#print STDERR $re_bit;

sub is_sane_utf8 {
    my $string = shift;
    my $warnings = shift || $Debug || 0;

    while ( $string =~ /($re_bit)/o ) {

        # work out what the double encoded string was
        my $bytes = $1;

        my $index = $+[0] - length($bytes);
        my $codes = join '', map { sprintf '<%00x>', ord($_) } split //,
            $bytes;

        # what character does that represent?
        my $char = Encode::decode( "utf8", $bytes );
        my $ord  = ord($char);
        my $hex  = sprintf '%00x', $ord;
        $char = charnames::viacode($ord);

        # print out diagnostic messages
        if ($warnings) {

            warn(qq{Found dodgy chars "$codes" at char $index\n});
            if ( Encode::is_utf8($string) ) {
                warn("Chars in utf8 string look like utf8 byte sequence.");
            }
            else {
                warn("String not flagged as utf8...was it meant to be?\n");
            }
            warn(
                "Probably originally a $char char - codepoint $ord (dec), $hex (hex)\n"
            );

        }
        return 0;
    }
    1;
}

sub is_valid_utf8 {
    if (   is_latin1( $_[0] )
        && !is_ascii( $_[0] )
        && !is_perl_utf8_string( $_[0] ) )
    {
        return 0;
    }
    return is_perl_utf8_string( $_[0] );
}

sub find_bad_latin1_report {
    my $bad = find_bad_latin1(@_);
    if ($bad) {

        # explain why we failed
        my $char = substr( $_[0], $bad, 1 );
        my $dec  = ord($char);
        my $hex  = sprintf '%x', $dec;
        carp("byte $bad ($char) is not Latin1 (it's $dec dec / $hex hex)");
    }
    return $bad;
}

sub looks_like_cp1252 {
    if (   !is_latin1( $_[0] )
        && !is_ascii( $_[0] )
        && $_[0] =~ m/[\x80-\x9f]/ )
    {
        return 1;
    }
    return 0;
}

my %win1252 = (
    "\x80" => "\x{20AC}",    #EURO SIGN
    "\x81" => '',            #UNDEFINED
    "\x82" => "\x{201A}",    #SINGLE LOW-9 QUOTATION MARK
    "\x83" => "\x{0192}",    #LATIN SMALL LETTER F WITH HOOK
    "\x84" => "\x{201E}",    #DOUBLE LOW-9 QUOTATION MARK
    "\x85" => "\x{2026}",    #HORIZONTAL ELLIPSIS
    "\x86" => "\x{2020}",    #DAGGER
    "\x87" => "\x{2021}",    #DOUBLE DAGGER
    "\x88" => "\x{02C6}",    #MODIFIER LETTER CIRCUMFLEX ACCENT
    "\x89" => "\x{2030}",    #PER MILLE SIGN
    "\x8A" => "\x{0160}",    #LATIN CAPITAL LETTER S WITH CARON
    "\x8B" => "\x{2039}",    #SINGLE LEFT-POINTING ANGLE QUOTATION MARK
    "\x8C" => "\x{0152}",    #LATIN CAPITAL LIGATURE OE
    "\x8D" => '',            #UNDEFINED
    "\x8E" => "\x{017D}",    #LATIN CAPITAL LETTER Z WITH CARON
    "\x8F" => '',            #UNDEFINED
    "\x90" => '',            #UNDEFINED
    "\x91" => "\x{2018}",    #LEFT SINGLE QUOTATION MARK
    "\x92" => "\x{2019}",    #RIGHT SINGLE QUOTATION MARK
    "\x93" => "\x{201C}",    #LEFT DOUBLE QUOTATION MARK
    "\x94" => "\x{201D}",    #RIGHT DOUBLE QUOTATION MARK
    "\x95" => "\x{2022}",    #BULLET
    "\x96" => "\x{2013}",    #EN DASH
    "\x97" => "\x{2014}",    #EM DASH
    "\x98" => "\x{02DC}",    #SMALL TILDE
    "\x99" => "\x{2122}",    #TRADE MARK SIGN
    "\x9A" => "\x{0161}",    #LATIN SMALL LETTER S WITH CARON
    "\x9B" => "\x{203A}",    #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
    "\x9C" => "\x{0153}",    #LATIN SMALL LIGATURE OE
    "\x9D" => '',            #UNDEFINED
    "\x9E" => "\x{017E}",    #LATIN SMALL LETTER Z WITH CARON
    "\x9F" => "\x{0178}",    #LATIN CAPITAL LETTER Y WITH DIAERESIS

);

# fix_latin (used in Transliterate) lacks the check for the
# prefixed \xc2 byte, but the UTF-8 encoding for these
# Windows codepoints has the leading \xc2 byte.
sub fix_cp1252_codepoints_in_utf8 {
    my $buf = shift;
    unless ( is_valid_utf8($buf) ) {
        my $badbyte = find_bad_utf8($buf);
        croak "bad UTF-8 byte(s) at $badbyte [ " . dump($buf) . " ]";
    }
    $Debug and warn "converting $buf\n";
    $buf =~ s/\xc2([\x80-\x9f])/$win1252{$1}/g;
    return $buf;
}

1;

__END__