| Alvis-TermTagger documentation | view source | Contained in the Alvis-TermTagger distribution. |
Alvis::TermTagger - Perl extension for tagging terms in a text
use Alvis::TermTagger;
Alvis::TermTagger::termtagging($text, $termlist, $outputfile);
or
Alvis::TermTagger::termtagging($text, $termlist, $outputfile, $lemmatised_text);
This module is used to tag a text with terms (either with inflected or
lemmatised form of their words). The text or the text corpus
($text) is a file with one sentence per line. Term list
($termlist) is a file containing one term per line. For each term,
additionnal information (as canonical form, a semantic tag and the
lemmatised word of the term) can be given after the first column. This
information can be separated by either a column, either by a vertical
bar. Each line of the output file ($outputfile) contains the
sentence number, the term, additional information, all separated by a
tabulation character. The lemmatised text ($lemmatised_text) is
build as the concatenation of the lemma of the word of the corpus;
This module is mainly used in the Alvis NLP Platform.
termtagging($corpus_filename, $term_list_filename, $output_filename, $lemmatised_corpus_filename);
This is the main method of module. It loads the term list
($term_list_filename) and tags the text corpus
($corpus_filename). It produces the list of matching terms and the
sentence offset (and additional information given in the input file)
where the terms can be found. The file $output_filename contains
this output. To look up the lemmatised term (as a concatenation of
lemmatised word), the lemmatised corpus $lemmatised_corpus_filename
has to be specified as fourth argument of the method.
load_TermList($term_list_filename,\@term_list);
This method loads the term list ($term_list_filename is the file
name) in the array given by reference (\@term_list). Each element
of term list contains a reference to a two element array (the term and
its canonical form).
get_Regex_TermList(\@term_list, \@regex_term_list, \@ref_regex_lemmaWordtermlist);
This method generates the regular expression from the term list
(\@term_list). stored in the specific array
(\@regex_term_list). \@ref_regex_lemmaWordtermlist records the
regular expression for the term lemma.
load_Corpus($corpus_filename\%corpus, \%lc_corpus);
This method loads the corpus ($corpus_filename) in hashtable
(\%corpus) and prepares the corpus in lower case (recorded in a
specific hashtable, \%lc_corpus)
corpus_Indexing(\%lc_corpus, \%corpus_index);
This method indexes the lower case version of the corpus
(\%lc_corpus) according the words \%corpus_index (the index is a
hashtable given by reference).
print_corpus_index(\%corpus_index);
This method prints on STDERR the corpus index \%corpus_index.
term_Selection(\%corpus_index, \@term_list, \%idtrm_select);
This method selects the terms from the term list (\@term_list)
potentially appearing in the corpus (that is the indexed corpus,
\%corpus_index). Results are recorded in the hash table
\%idtrm_select.
term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename);
This method tags the corpus \%corpus with the terms (issued from
the term list \@term_list, \@regex_term_list is the term list
with regular expression), and selected in a previous step
(\%idtrm_select). Resulting selected terms are recorded with their
offset, and additional information in the file $output_filename.
term_tagging_offset_tab(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \@tab_results);
or
term_tagging_offset_tab(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \%tabh_results);
This method tags the corpus \%corpus with the terms (issued from
the term list \@term_list, \@regex_term_list is the term list
with regular expression), and selected in a previous step
(\%idtrm_select). Resulting selected terms are recorded with their
offset, and additional information in the array @tab_results
(values are sentence id, selected terms and additional information
separated by tabulation) or in the hashtable %tabh_results (keys
form is "sentenceid_selectedterm", values are an array reference
containing sentence id, selected terms and additional ifnormation).
printMatchingTerm($descriptor, $ref_matching_term, $sentence_id);
This method prints into the file descriptor $descriptor, the
sentence id ($sentence_id) and the matching term (named by its
reference $ref_matching_term). Both data are on a line and are
separated by a tabulation character.
printMatchingTerm_tab($ref_matching_term, $sentence_id, $ref_tab_results);
This method stores into $ref_tab_results, the sentence id
($sentence_id) and the matching term (named by its reference
$ref_matching_term). $ref_tab_results can be a array or a hash
table. In case of an array, both data are concatanated in a line and
are separated by a tabulation character. In case of a hash table, both
data are stored in an array, hash key is the concatenation of the
sentence id and the matching term.
Alvis web site: http://www.alvis.info
Thierry Hamon <thierry.hamon@univ-paris13.fr>
Copyright (C) 2006 by Thierry Hamon
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.
| Alvis-TermTagger documentation | view source | Contained in the Alvis-TermTagger distribution. |