| String-Canonical documentation | Contained in the String-Canonical distribution. |
String::Canonical - Creates canonical strings.
use String::Canonical qw/cstr/;
print cstr("one thousand maniacs");
print String::Canonical::get("Second tier");
This module generates a canonical string by converting roman numerals to digits, English descriptions of numbers to digits, stripping off all accents on characters (as well as handling oe = ö, ae = æ, etc.), replacing words with symbols (e.g. and = &, plus = +, etc.) and removing common variant endings.
In short, this module generates the same signature for the following strings:
bjørk = björk = bjoerk = bjork
1,000 maniacs = one thousand maniacs = 1k maniacs
Boyz II Men = Boyz To Men = Boyz 2 Men
ACDC = AC/DC = AC-DC
Rubin and company = Rubin & Company = Rubin & Co.
Third Eye Blind = 3rd eye blind
Train runnin' = Train Running
The following functions may be imported into the caller package by name:
Returns the canonical form of the string passed. If no string is passed, $_ is used. When called in void context the function will set $_. The functon may also be accessed as get but only cstr may be exported.
Compares two strings. Note that if the second string is not provided, $_ is used.
Erick Calder <ecalder@cpan.org>
For help and thank you notes, e-mail the author directly. To report a bug, submit a patch or add to our wishlist please visit the CPAN bug manager at: http://rt.cpan.org
The latest version of the tarball, RPM and SRPM may always be found at: http://perl.arix.com/ Additionally the module is available from CPAN.
This utility is free and distributed under GPL, the Gnu Public License. A copy of this license was included in a file called LICENSE. If for some reason, this file was not included, please see http://www.gnu.org/licenses/ to obtain a copy of this license.
$Id: Canonical.pm,v 1.2 2003/02/15 01:44:39 ekkis Exp $
| String-Canonical documentation | Contained in the String-Canonical distribution. |
#!/usr/bin/perl
# --- prologue ---------------------------------------------------------------- package String::Canonical; require 5.000; use warnings; use strict; use Exporter; use Lingua::EN::Numericalize; # interpret English use Text::Roman qw/roman2int/; # interpret Roman numbers use vars qw/$VERSION @ISA @EXPORT_OK/; $VERSION = substr q$Revision: 1.2 $, 10; @ISA = qw/Exporter/; @EXPORT_OK = qw/&cstr &cstr_cmp/; my @dx; # deletions my %yx; # transliterations my %sx; # replacements # --- module interface --------------------------------------------------------
sub get { &cstr; } sub cstr { my $s = lc(shift || $_) || return; local $_ if defined wantarray(); $s =~ s/\Q$_\E/$sx{$_}/gi for keys %sx; eval "\$s =~ y/$_/$yx{$_}/" for keys %yx; $s =~ s/\Q$_\E//g for @dx; ($_, $s) = (str2nbr($s), ""); $s .= roman2int() || $_ for split; $s =~ s/[_\W]//g; $_ = $s; }
sub cmp { &cstr_cmp; } sub cstr_cmp { my $s1 = shift; my $s2 = shift || $_; cstr($s1) eq cstr($s2); } # --- internal structures ----------------------------------------------------- @dx = qw/the da/; %sx = ( "company" => "co", "brother" => "bro", "to" => 2, "for" => 4, "mister" => "mr", "senior" => "sr", "o'" => "of", "ol'" => "old", "in'" => "ing", "oe" => "o", "ae" => "a", "@" => "at", "&" => "and", "'n" => "and", " n'" => "and", "'n'" => "and", "#" => "no", "nbr" => "no", "number" => "no", "%" => "pct", "percent" => "pct", "volume" => "vol", "ß" => "ss", "+" => "plus", ); %yx = ( "äÄàÀáÁåÅâÂãÃ" => "a", "ëËèÈéÉêÊ" => "e", "ïÏìÌíÍîÎ" => "i", "öÖòÒóÓôÔõÕ" => "o", "üÜùÙúÚûÛ" => "u", "æÆøØçÇñÑðÐþÞýÝÿÿ" => "aaooccnnddddyyyy", );