Encode::Repair - Repair wrongly encoded text strings


Encode-Repair documentation Contained in the Encode-Repair distribution.

Index


Code Index:

NAME

Top

Encode::Repair - Repair wrongly encoded text strings

SYNOPSIS

Top

    # Simple usage
    use Encode::Repair qw(repair_double);
    binmode STDOUT, ':encoding(UTF-8)';

    # prints: small ae: ä
    print repair_double("small ae: \xc3\x83\xc2\xa4\n");

    # prints: beta: β
    print repair_double("beta: \xc4\xaa\xc2\xb2\n", {via => 'Latin-7'});




    # Advanced usage
    # assumes you have a sample text both correctly decoded in a
    # character string, and as a wrongly encoded buffer

    use Encode::Repair qw(repair_encoding learn_recoding);
    use charnames qw(:full);
    binmode STDOUT, ':encoding(UTF-8)';

    my $recoding_pattern  = learn_recoding(
        from        => "beta: \xc4\xaa\xc2\xb2",
        to          => "beta: \N{GREEK SMALL LETTER BETA}",
        encodings   => ['UTF-8', 'Latin-1', 'Latin-7'],
    );
    if ($recoding_pattern) {
        my $mojibake = "\304\252\302\273\304\252\302\261\304\252\302"
                    ."\274\304\252\342\200\234\304\252\302\261";
        print repair_encoding($mojibake, $recoding_pattern), "\n";
    } else {
        print "Sorry, could not help you :-(\n";
    }




DESCRIPTION

Top

Sometimes software or humans mess up the character encoding of text. In some cases it is possible to reconstruct the original text. This module helps you to do it.

It covers the rather common case that a program assumes a wrong character encoding on reading some input, and converts it to Mojibake (see http://en.wikipedia.org/wiki/Mojibake).

If you use this module on a regular basis, it most likely indicates that something is wrong in your processs. It should only be used for one-time tasks such as migrating a database to a new system.

FUNCTIONS

Top

repair_double

Repairs the common case when a UTF-8 string was read as another encoding, and was encoded as UTF-8 again. The other encoding defaults to ISO-8859-1 aka Latin-1, and can be overridden with the via option:

    my $repaired = repair_double($buffer, {via => 'ISO-8859-2' });

It expects an octet string as input, and returns a decoded character string.

learn_recoding

Given a sample of text twice, once correctly decoded and once mistreated, attemps to find a sequence of encoding and decoding that turns the mistreated text into the correct form.

    my $coding_pattern = learn_recoding(
        from        => $mistreated_buffer,
        to          => $correct_string,
        encodings   => \@involved_encodings,
        depth       => 5,
        search      => 'first',
    );

encodings should be an array reference containing all the character encodings involved in the process that messes up the encoding. If you don't know these, try it with UTF-8, ISO-8859-1 and the encoding that your system uses by default.

depth is the maximal number of encoding and decoding steps to be tried. For example repair_double needs three steps. Defaults to 5; higher values might slow down the program significantly, although smaller depths are tried first.

The return value is undef on failure, and an array reference otherwise. It returns the encoding/decoding steps suitable for feeding into repair_encoding. It contains a list of even size, where elements with even indexes are either 'encode' or 'decode', and those with odd indexes contain the name of the encoding.

With search you can adjust how long the function searches for a recoding sequence. WIth the default of 'first' it returns the first possible sequence. With 'shallow' it searches for the first working sequence and all other sequences of the same length, and then returns an array reference containing array references to all sequences. With the value 'all', all possible sequences are searched and returned, but often that's a very bad idea, because it also finds sequences where parts of the sequence undo the work of other sequences (something like [qw(encode latin-1 decode latin-1)]).

Since Version 0.0.2 learn_recoding forces strict pattern of alternatining encoding and decoding. So even if ['decode', 'UTF-8', 'decode', 'UTF-8'] is a working input, learn_recoding will return ['decode', 'UTF-8', 'encode', 'Latin-1', 'decode', 'UTF-8'] instead. So you might have to include Latin-1 in your encoding list even if it is not strictly involved.

repair_encoding

Takes an input string and an encoding/decoding pattern (as returned from learn_recoding) as input and returns the repaired string.

Troubleshooting

Top

If learn_recoding returns undef, you can increase the depth option value (for example to 7). If that doesn't help, check that the two input strings actually corespond. learn_recoding does an exact equality check, so trailing newline characters or spaces will cause it to fail.

If repair_encoding produces errors or warnings, it is likely that the sample you used for learning was not long enough, or not representative. For example if your system uses both ISO-8859-1 and ISO-8859-15 (which are quite similar), learn_recoding uses the first match, so the sample data has to contain at least one character that's in ISO-8859-15 but not in ISO-8859-1, like the Euro sign (€).

Further Reading

Top

This document tries to stick to the terminology introduced in the Encode module.

If you want to learn more about the way text is encoded and how perl handles that, take a look at http://perlgeek.de/en/article/encodings-and-unicode.

LICENSE AND COPYRIGHT

Top

Development

Top

The source code is stored in a public git repository at http://github.com/moritz/Encode-Repair. If you find any bugs, please used the issue tracker linked from this site.

If you find a case of messed-up encodings that can be repaired deterministically and that's not covered by this module, please contact the author, providing a hex dump of both input and output, and as much information of the encoding and decoding process as you have.

Patches are also very welcome.


Encode-Repair documentation Contained in the Encode-Repair distribution.

package Encode::Repair;
our $VERSION = '0.0.2';
use strict;
use warnings;

our @EXPORT_OK = qw(repair_double learn_recoding repair_encoding);
use Exporter qw(import);
use Encode qw(encode decode);
use Algorithm::Loops qw(NestedLoops MapCar);

# since Algorithm::Loops already provides MapCar, it is very easy to implement
# zip() with it, instead of introducing another dependency (on
# List::MoreUtils, specifically)
sub zip {
    MapCar {  @_ == 2 ? @_ : () } @_;
}

my %subs = (
    encode  => \&encode,
    decode  => \&decode,
);

sub repair_encoding {
    my ($str, $actions) = @_;
    for (my $i = 0; $i < @$actions; $i += 2) {
        my $type     = $actions->[$i];
        my $encoding = $actions->[$i+1];
        no warnings 'utf8';
        $str = $subs{$type}->($encoding, $str);
    }
    $str;
}

sub repair_double {
    my ($buf, $options) = @_;
    my $via = 'ISO-8859-1';
    $via = $options->{via} if $options && exists $options->{via};
    repair_encoding($buf, [
            'decode', 'UTF-8',
            'encode', $via,
            'decode', 'UTF-8',
    ]);
}

sub learn_recoding {
    my %args        = @_;
    my $source      = $args{from};
    my $target      = $args{to};
    my $encodings   = $args{encodings};
    my $maxdepth    = $args{depth} || 5;
    my $search_mode = $args{search} || 'first';
    return [] if $source eq $target;

    my @result;
    for my $depth (1..$maxdepth) {
        my $iter = NestedLoops( [($encodings) x $depth] );
        my @ed   =  (qw(encode decode)) x (int($depth / 2) + 1);
        my @de   =  (qw(decode encode)) x (int($depth / 2) + 1);
        while (my @steps = $iter->()) {
            no warnings 'uninitialized';
            for my $steps ([zip \@ed, \@steps], [zip \@de, \@steps]) {
#                use Data::Dumper;
#                warn Dumper($steps);
                if (eval {repair_encoding($source, $steps)} eq $target) {
                    if (lc($search_mode) eq 'first') {
                        return $steps;
                    } else {
                        push @result, $steps;
                    }
                }
            }
        }
        return \@result if @result && lc($search_mode) eq 'shallow';
    }
    return \@result if @result;
    return;
}

1;