| Text-Identify-BoilerPlate documentation | view source | Contained in the Text-Identify-BoilerPlate distribution. |
Text::Identify::BoilerPlate - Remove repeated text
Version 0.3.1
Finds boilerplate text (lines that are repeated across documents) in a list of plain text files.
use Text::Identify::BoilerPlate;
my @files = ('file1', 'file2', 'file3');
rem_boilerplate(\@files, { min_dupl => 4, ignore_digits => 0 });
New files are written, containing everything but the boilerplate text.
rem_boilerplate() takes two arguments: A reference to a list of files
to be processed, and a reference to a hash of options.
The options are:
min_duplThe minimum number of thimes a line has to occur to be considered boilerplate (default: 3). Can be either an integer or a percentage ('50%') of the number of files processed. Minimum value: 2.
ignore_digitsLines only seperated by differences in digits will be considered duplicates (default: yes).
suffixAdded to the new files (default: 'content').
Only sets consecutive lines of duplicates at the start and end of documents are considered boilerplate (default: yes).
digestLines will be replaced by a MD5 digest during duplicate compilation, saving memory (default: no).
logNname of the log file, where deleted lines are recorded; if set to false, no log will be created (default: './text-identify-boilerplate.log').
Lars Nygaard, <lars.nygaard@inl.uio.no>
The program needs extensive testing and tweaking before the simple algorithm can give consistently high-quality results.
Please report any bugs or feature requests to
bug-text-identify-boilerplate@rt.cpan.org, or through the web interface at
http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Identify-BoilerPlate.
I will be notified, and then you'll automatically be notified of progress on
your bug as I make changes.
Copyright 2005 Lars Nygaard, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
| Text-Identify-BoilerPlate documentation | view source | Contained in the Text-Identify-BoilerPlate distribution. |