| KSx-Analysis-StripAccents documentation | Contained in the KSx-Analysis-StripAccents distribution. |
KSx::Analysis::StripAccents - Remove accents and fold to lowercase
0.05 (beta)
my $stripper = KSx::Analysis::StripAccents->new;
my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [ $tokenizer, $stripper, $stemmer ],
);
This analyser strips accents from its input, removes accents, and converts it to lowercase. It may end up changing the length of a token, so make sure that this analyser is not used before a tokenizer.
Construct a new accent-stripping analyser.
This module requires perl and the following modules, which you can get from the CPAN:
KinoSearch 0.2 or later
Copyright (C) Father Chrysostomos
This program is free software; you may redistribute or modify it (or both) under the same terms as perl.
KinoSearch::Analysis::Analyzer (the base class)
KinoSearch::Analysis::LCNormalizer (which this module was based on, and is intended as a drop-in replacement for)
KinoSearch::Analysis::CaseFolder (what LCNormalizer has been renamed in the dev branch of KinoSearch)
| KSx-Analysis-StripAccents documentation | Contained in the KSx-Analysis-StripAccents distribution. |
use strict; use warnings; package KSx::Analysis::StripAccents; use base qw( KinoSearch::Analysis::Analyzer ); our $VERSION = '0.05'; use Encode qw 'encode decode'; use Text::Unaccent 'unac_string_utf16'; sub analyze_batch { my ( $self, $batch ) = @_; # lc and unaccent all of the terms, one by one while ( my $token = $batch->next ) { # I have to use UTF-16BE, since, although itâs not documented, # Text::Unaccent only supports big-endian. And I have to encode it, # since it doesnât support Perlâs Unicode strings. (And itâll con- # vert it to UTF-16 behind the scenes anyway, if I donât.) $token->set_text( lc uc decode 'utf-16be', unac_string_utf16 encode 'UTF-16BE', $token->get_text ); # We have an âlc ucâ there, since some letters wonât be normalised # properly without it; e.g., âΣÏÏâ should be normalised to three # instances of the same character (âÏÏÏâ as opposed to âÏÏÏâ). } $batch->reset; return $batch; } *transform = *analyze_batch; 1; __END__