| Lingua-EN-Segmenter documentation | view source | Contained in the Lingua-EN-Segmenter distribution. |
Lingua::EN::Segmenter::TextTiling - Segment text using the TextTiling method
use Lingua::EN::Segmenter::TextTiling qw(segments); use lib '.'; my $text = <<EOT; Lingua::EN::Segmenter is a useful module that allows text to be split up into words, paragraphs, segments, and tiles. Paragraphs are by default indicated by blank lines. Known segment breaks are indicated by a line with only the word "segment_break" in it. The module detects paragraphs that are unrelated to each other by comparing the number of words per-paragraph that are related. The algorithm is designed to work only on long segments. SOUTH OF BAGHDAD, Iraq (CNN) -- Seven U.S. troops freed Sunday after being held by Iraqi forces arrived by helicopter at a base south of Baghdad and were transferred to a C-130 transport plane headed for Kuwait, CNN's Bob Franken reported from the scene. EOT my $num_segment_breaks = 1; my @segments = segments($num_segment_breaks,$text); print $segments[0]; # Prints the first three paragraphs of the above text print "\n----------SEGMENT_BREAK----------\n"; print $segments[1]; # Prints the last paragraph of the above text # This module can also be used in an object-oriented fashion my $splitter = new Lingua::EN::Splitter; @words = $splitter->words($text);
See synopsis.
This module is designed to be easily extendable. Feel free to extend from this module when designing alternate methods for text segmentation.
David James <splice@cpan.org>
Copyright (c) 2002 David James All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
| Lingua-EN-Segmenter documentation | view source | Contained in the Lingua-EN-Segmenter distribution. |