Text::Normalize::NACO - Normalize text based on the NACO rules


Text-Normalize-NACO documentation Contained in the Text-Normalize-NACO distribution.

Index


Code Index:

NAME

Top

Text::Normalize::NACO - Normalize text based on the NACO rules

SYNOPSIS

Top

    # exported method
    use Text::Normalize::NACO qw( naco_normalize );

    $normalized = naco_normalize( $original );

    # as an object
    $naco       = Text::Normalize::NACO->new;
    $normalized = $naco->normalize( $original );

    # normalize to lowercase
    $naco->case( 'lower' );
    $normalized = $naco->normalize( $original );

DESCRIPTION

Top

In general, normalization is defined as:

    To make (a text or language) regular and consistent, especially with respect to spelling or style.

It is commonly used for comparative purposes. These particular normalization rules have been set out by the Name Authority Cooperative. The rules are described in detail at: http://www.loc.gov/catdir/pcc/naco/normrule.html

INSTALLATION

Top

    perl Makefile.PL
    make
    make test
    make install

METHODS

Top

new( %options )

Creates a new Text::Normalize::NACO object. You explicitly request strings to be normalized in upper or lower-case by setting the "case" option (defaults to "upper").

    my $naco = Text::Normalize::NACO->new( case => 'lower' );

case( $case )

Accessor/Mutator for the case in which the string should be returned.

    # lower-case
    $naco->case( 'lower' );

    # upper-case
    $naco->case( 'upper' );

naco_normalize( $text, { %options } )

Exported version of normalize. You can specify any extra options by passing a hashref after the string to be normalized.

    my $normalized = naco_normalize( $original, { case => 'lower' } );

normalize( $text )

Normalizes $text and returns the new string.

    my $normalized = $naco->normalize( $original );

SEE ALSO

Top

* http://www.loc.gov/catdir/pcc/naco/normrule.html

AUTHOR

Top

Brian Cassidy <bricas@cpan.org>

COPYRIGHT AND LICENSE

Top


Text-Normalize-NACO documentation Contained in the Text-Normalize-NACO distribution.
package Text::Normalize::NACO;

use base qw( Exporter );

use strict;
use warnings;

use Text::Unidecode;

our $VERSION = '0.13';

our @EXPORT_OK = qw( naco_normalize );

sub new {
    my $class   = shift;
    my %options = @_;
    my $self    = bless {}, $class;

    $self->case( $options{ case } || 'upper' );

    return $self;
}

sub case {
    my $self = shift;
    my ( $case ) = @_;

    $self->{ _CASE } = $case if @_;

    return $self->{ _CASE };
}

sub naco_normalize {
    my $text    = shift;
    my $options = shift;
    my $case    = $options->{ case } || 'upper';

    my $normalized = normalize( undef, $text );

    if ( $case eq 'lower' ) {
        $normalized =~ tr/A-Z/a-z/;
    }
    else {
        $normalized =~ tr/a-z/A-Z/;
    }

    return $normalized;
}

sub normalize {
    my $self = shift;
    my $data = shift;

    # Rules taken from NACO Normalization
    # http://lcweb.loc.gov/catdir/pcc/naco/normrule.html

    # Remove diacritical marks and convert special chars
    unidecode( $data );

    # Convert special chars to spaces
    $data =~ s/[\Q!(){}<>-;:.?,\/\\@*%=\$^_~\E]/ /g;

    # Delete special chars
    $data =~ s/[\Q'[]|\E]//g;

    # Convert lowercase to uppercase or vice-versa.
    if ( $self ) {
        if ( $self->case eq 'lower' ) {
            $data =~ tr/A-Z/a-z/;
        }
        else {
            $data =~ tr/a-z/A-Z/;
        }
    }

    # Remove leading and trailing spaces
    $data =~ s/^\s+|\s+$//g;

    # Condense multiple spaces
    $data =~ s/\s+/ /g;

    return $data;
}

1;