| SWISH-Filter documentation | Contained in the SWISH-Filter distribution. |
SWISH::Filters::Doc2txt - Perl extension for filtering MSWord documents with Swish-e
This is a plug-in module that uses the "catdoc" program to convert MS Word documents to text for indexing by Swish-e. "catdoc" can be downloaded from:
http://www.ice.ru/~vitus/catdoc/ver-0.9.html
The program "catdoc" must be installed and your PATH before running Swish-e.
This filter does not specify input or output character encodings. This will change in the future to all use of the user_data to set the encoding.
A minor optimization during spidering (i.e. when docs are in memory instead of on disk) would be to use open2() call to let catdoc read from stdin instead of from a file.
Bill Moseley
| SWISH-Filter documentation | Contained in the SWISH-Filter distribution. |
package SWISH::Filters::Doc2txt; use strict; use vars qw( $VERSION @ISA ); $VERSION = '0.15'; @ISA = ('SWISH::Filters::Base'); sub new { my ($class) = @_; my $self = bless { mimetypes => [qr!application/(x-)?msword!] , # list of types this filter handles priority => 50 , # Make this a higher number (lower priority than the wvware filter }, $class; # check for helpers return $self->set_programs('catdoc'); } sub filter { my ( $self, $doc ) = @_; my $content = $self->run_catdoc( $doc->fetch_filename ) || return; # update the document's content type $doc->set_content_type('text/plain'); # return the document return \$content; } 1; __END__