| SWISH-Filter documentation | Contained in the SWISH-Filter distribution. |
SWISH::Filter::Document - object model for result of SWISH::Filter
A SWISH::Filter::Document object is returned by the SWISH::Filter convert() method. This class is intended to be used privately, but you might subclass it in order to extend or modify its behaviour.
These methods are available to Filter authors, and also provide access to the document after calling the convert() method to end-users of SWISH::Filter.
End users of SWISH::Filter will use a subset of these methods. See User Methods.
Filter authors will also be interested in the Author Methods secion. The filter() method in each Filter module is passed a SWISH::Filter::Document object. Method calls may be made on this object to check the document's current content type, or to fetch the document as either a file name or a reference to a scalar containing the document content.
These methods are intended primarily for those folks using SWISH::Filter. If you are writing a filter, see also Author Methods.
Returns a scalar reference to the document. This can be used when the filter can operate on the document in memory (or if an external program expects the input to be from standard input).
If the file is currently on disk then it will be read into memory. If the file was stored in a temporary file on disk the file will be deleted once read into memory. The file will be read in binmode if $doc->is_binary is true.
Note that fetch_doc() is an alias.
Returns true if some filter processed the document
Fetches the current content type for the document.
Example:
return unless $filter->content_type =~ m!application/pdf!;
Returns a parser type based on the content type. Returns undef if no parser type is mapped.
Returns true if the document's content-type does not match "text/".
These methods are intended for those folks writing filters.
Returns a path to the document as stored on disk.
This name can be passed to external programs (e.g. catdoc) that expect input
as a file name.
If the document is currently in memory then a temporary file will be created. Do not expect the file name passed to be the real path of the document.
The file will be written in binmode if is_binary() returns true.
This method is not normally used by end-users of SWISH::Filter.
Processing will continue to the next filter if this is set to a true value. This should be set for filters that change encodings or uncompress documents.
Sets the content type for a document.
Fetches the name of the current file. This is useful for printing out the name of the file in an error message. This is the name passed in to the SWISH::Filter convert() method. It is optional and thus may not always be set.
my $name = $doc_object->name || 'Unknown name';
warn "File '$name': failed to convert -- file may be corrupt\n";
Fetches the the user_data passed in to the filter. This can be any data or data structure passed into SWISH::Filter new().
This is an easy way to pass special parameters into your filters.
Example:
my $data = $doc_object->user_data;
# see if a choice for the <title> was passed in
if ( ref $data eq 'HASH' && $data->{pdf2html}{title_field} {
...
...
}
Similar to user_data() but specifically intended for name/value pairs
in the meta tags in HTML or XML documents.
If set, either via new() or explicitly via the meta_data() method, the value of meta_data() can be used in a filter to set meta headers.
The value of meta_data() should be a hash ref so that it is easy to pass to SWISH::Filters::Base->format_meta_headers().
After a document is filtered, the meta_data() method can be used to retrieve the values that the filter inserted into the filtered document. This value is (again) a hash ref, and is set by the SWISH::Filter module if the filter() method returns a second value. Because the filter module might also extract meta data from the document itself, and might insert some of its own, it is up to the individual filter to determine how and what it handles meta data. See SWISH::Filters::Pdf2HTML for an example.
See the filter() method description in SWISH::Filter, the section on WRITING FILTERS.
Example:
my $doc = $filter->convert( meta_data => {foo => 'bar'} );
my $meta = $doc->meta_data;
# $meta *probably* is {foo => 'bar'} but that's up to how the filter handled
# the value passed in convert(). Could also be { foo => 'bar', title => 'some title' }
# for example.
Filters can be tested with the swish-filter-test program in the example/
directory. Run:
swish-filter-test -man
for documentation.
Please contact the Swish-e discussion list. http://swish-e.org
Bill Moseley
Currently maintained by Peter Karman perl@peknet.com
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
| SWISH-Filter documentation | Contained in the SWISH-Filter distribution. |
package SWISH::Filter::Document; use strict; use Carp; use File::Temp; use Symbol; use vars qw/ $VERSION $AUTOLOAD /; $VERSION = '0.15'; # Map content types to swish-e parsers. my %swish_parser_types = ( 'text/html' => 'HTML*', 'text/xml' => 'XML*', 'application/xml' => 'XML*', 'text/plain' => 'TXT*', );
# Returns a new SWISH::Filter::document object # or null if just can't process the document sub new { my ( $class, $doc, $content_type ) = @_; return unless $doc && $content_type; my $self = bless {}, $class; if ( ref $doc ) { unless ( length $$doc ) { warn "Empty document passed to filter\n"; return; } croak "Must supply a content type when passing in a reference to a document\n" unless $content_type; } else { unless ( -r $doc ) { warn "Filter unable to read doc '$doc': $!\n"; return; } } $self->set_content_type($content_type); $self->{cur_doc} = $doc; return $self; } # Clean up any temporary files sub DESTROY { my $self = shift; $self->remove_temp_file; } sub cur_doc { my ( $self, $doc_ref ) = @_; $self->{cur_doc} = $doc_ref if $doc_ref; return $self->{cur_doc}; } sub remove_temp_file { my $self = shift; unless ( $ENV{FILTER_DEBUG} ) { unlink delete $self->{temp_file} if $self->{temp_file}; } } # Used for tracking what filter(s) were used in processing sub filters_used { my $self = shift; return $self->{filters_used} || undef; } sub dump_filters_used { my $self = shift; my $used = $self->filters_used; local $SIG{__WARN__}; warn "\nFinal Content type for ", $self->name, " is ", $self->content_type, "\n"; unless ($used) { warn " *No filters were used\n"; return; } warn " >Filter $_->{name} converted from [$_->{start_content_type}] to [$_->{end_content_type}]\n" for @$used; }
sub fetch_doc_reference { my ($self) = @_; return ref $self->{cur_doc} # just $self->read_file should work ? $self->{cur_doc} : $self->read_file; } # here's an alias for fetching a document reference. *fetch_doc = *fetch_doc_reference;
sub was_filtered { my $self = shift; return $self->filters_used ? 1 : 0; }
sub content_type { return $_[0]->{content_type} || ''; }
sub swish_parser_type { my $self = shift; my $content_type = $self->content_type || return; for ( keys %swish_parser_types ) { return $swish_parser_types{$_} if $content_type =~ /^\Q$_/; } return; }
sub is_binary { my $self = shift; return $self->content_type !~ m[^text|xml$]; }
# This will create a temporary file if file is in memory sub fetch_filename { my ($self) = @_; return ref $self->{cur_doc} ? $self->create_temp_file : $self->{cur_doc}; }
sub set_continue { my ($self) = @_; return $self->continue(1); } sub continue { my ( $self, $continue ) = @_; my $old = $self->{continue} || 0; $self->{continue}++ if $continue; return $old; }
sub set_content_type { my ( $self, $type ) = @_; croak "Failed to pass in new content type\n" unless $type; $self->{content_type} = $type; } sub read_file { my $self = shift; my $doc = $self->{cur_doc}; return $doc if ref $doc; my $sym = gensym(); open( $sym, "<$doc" ) or croak "Failed to open file '$doc': $!"; binmode $sym if $self->is_binary; local $/ = undef; my $content = <$sym>; close $sym; $self->{cur_doc} = \$content; # Remove the temporary file, if one was created. $self->remove_temp_file; return $self->{cur_doc}; } # write file out to a temporary file sub create_temp_file { my $self = shift; my $doc = $self->{cur_doc}; return $doc unless ref $doc; my ( $fh, $file_name ) = File::Temp::tempfile(); # assume binmode if we need to filter... binmode $fh if $self->is_binary; print $fh $$doc or croak "Failed to write to '$file_name': $!"; close $fh or croak "Failed to close '$file_name' $!"; $self->{cur_doc} = $file_name; $self->{temp_file} = $file_name; return $file_name; }
sub AUTOLOAD { my ( $self, $newval ) = @_; no strict 'refs'; if ( $AUTOLOAD =~ /.*::set_(\w+)/ && $SWISH::Filter::extra_methods{$1} ) { my $attr_name = $1; *{$AUTOLOAD} = sub { $_[0]->{$attr_name} = $_[1]; return }; return $self->{$attr_name} = $newval; } elsif ( $AUTOLOAD =~ /.*::(\w+)/ && $SWISH::Filter::extra_methods{$1} ) { my $attr_name = $1; *{$AUTOLOAD} = sub { return $_[0]->{$attr_name} }; return $self->{$attr_name}; } croak "No such method: $AUTOLOAD\n"; } 1; __END__