| Net-ChooseFName documentation | view source | Contained in the Net-ChooseFName distribution. |
Net::ChooseFName - Perl extension for choosing a name of a local mirror of a net (e.g., FTP or HTTP) resource.
use Net::ChooseFName; $namer = Net::ChooseFName->new(max_length => 64); # Copies to CD ok $name = $namer->find_name_by_response($LWP_response); $name = $namer->find_name_by_response($LWP_response, $as_if_content_type); $name = $namer->find_name_by_url($url, $suggested_name, $content_type, $content_encoding); $name = $namer->find_name_by_url($url, $suggested_name, $content_type); $name = $namer->find_name_by_url($url, $suggested_name); $name = $namer->find_name_by_url($url); $namer_returns_undef = Net::ChooseFName->failer(); # Funny constructor
This module helps to pick up a local file name for a remote resource
(e.g., one downloaded from Internet). It turns out that this is a
tricky business; keep in mind that most servers are misconfigured,
most URLs are malformed, and most filesystems are limited
w.r.t. possible filenames. As a result most downloaders fail to work
in some situations since they choose names which are not supported on
particular filesystems, or not useful for file:///-related work.
Because of the many possible twists and ramifications, the design of
this module is to be as much configurable as possible. One of ways of
configurations is a rich system of options which influence
different steps of the process. To cover cases when options are not
flexible enough, the process is broken into many steps; each step is
easily overridable by subclassing Net::ChooseFName.
The defaults are chosen to be as safe as possible while not getting
very much into the ways. For example, since % is a special
character on DOSish shells, to simplify working from command line on
such systems, we avoid this letter in generated file names.
Similarly, since MacOS has problems with filenames with 8-bit
characters, we avoid them too; since may Unix programs have problem
with spaces in file names, we massage them into underscores; the
length of the longest file path component is restricted to 255 chars.
Note that in many situations it is advisable to make these
restrictions yet stronger. For example, for copying to CD one should
restrict names yet more (max_length => 64); for copying to MSDOS
file systems enable option '8+3' => 1.
[In the description of methods the $self argument is omitted.]
Constructor method. Creates an object with given options. Default values for the unspecified options are (comments list in which methods this option is used):
protect => # protect_characters()
# $1 should contain the match
qr/([?*|\"<>\\:?#\x00-\x1F\x7F-\xFF\[\])/,
protect_pref => '@', # protect_characters(), protect_directory()
root => '.', # find_directory()
dir_mode => 0775, # directory_found()
mkpath => 1, # directory_found()
max_suff_len => 4, # split_suffix() 'jpeg'
keepsuff_same_mediatype => 1, # choose_suffix()
type_suff => # choose_suffix()
{'text/ftp-dir-listing' => '.dirl'}
keep_suff => { text/plain => 1,
application/octet-stream => 1 },
short_suffices => # eight_plus_three()
{jpeg => 'jpg', html => 'htm',
'tar.bz2' => 'tbz', 'tar.gz' => 'tgz'},
suggest_disposition => 1, # find_name_by_response()
suggested_only_basename => 1, # find_name_by_response(), raw_name()
fix_url_backslashes => 1, # protect_characters()
max_length => 255, # fix_dups(), fix_component()
cache_name => 1, # name_found()
queryless_types => # url_takes_query()
{ map(($_ => 1), # http://filext.com/detaillist.php?extdetail=DJV 2005/01
qw(image/djvu image/x-djvu image/dejavu image/x-dejavu
image/djvw image/x.djvu image/vnd.djvu ))},
queryless_ext => { 'djvu' => 1, 'djv' => 1 }, # url_takes_query()
The option type_suff is special so that the user-specified value is
added to this hash, and not replaces it. Similarly, the value
of option html_suff is used to populate the value for text/html
of this hash.
Other, options have undef as the default value. Their effects are
documented in the documentation of the methods they affect. With the
exception of known_names, these options are booleans.
html_suff # new() known_names # known_names() name_found(); hash ref or undef only_known # known_names() hierarchical # raw_name(), find_directory() use_query # raw_name() 8+3 # fix_basename(), fix_component() keep_space # fix_component() keep_dots # fix_component() tolower # fix_component() dir_query # find_directory() site_dir # find_directory() ignore_existing_files # fix_dups keep_nosuff, type_suff_no_enc, type_suff_fallback, type_suff_fallback_no_enc # choose_suffix()
Summary of the most useful in applications options (with defaults if applicable):
html_suff # Suffix for HTML (dot will be prepended) root => '.', # Where to put files? mkpath => 1, # Create directories with chosen names? max_length => 255, # Maximal length of a path component ignore_existing_files # Should the filename be "new"? cache_name => 1, # Return the same filename on the same URL, # even if file jumped to existence? hierarchical # Only the last component of URL path matters? suggested_only_basename => 1, # Should suggested name be relative the path? use_query # Do not ignore the query part of URL? # Value is used as (literal) prefix of query dir_query # Make the non-query part of URL a directory? site_dir # Put the hostname part of URL into directory? keepsuff_same_mediatype # Preserve the file extensions matching type? 8+3 # Is the filesystem DOSish? keep_space # Map spaces in URL to spaces in filenames? tolower # Translate filenames to lowercase? type_suff, type_suff_no_enc, type_suff_fallback, type_suff_fallback_no_enc, keep_suff, keep_nosuff # Hashes indexed by lowercased types; # Allow tuning choosing the suffix
This method returns a suitable filename for the resource given its URL. Optional arguments are a suggested name (possibly, it will be modified according to options of the object), the content-type, and the content-encoding of the resource. If multiple content-encodings are required, specify them as an array reference.
A chain of helper methods ("Transformation chain") is called to
apply certain transformations to the name. undef is returned if
any of the helper methods (except known_names() and protect_query())
return undefined values; the caller is free to interpret this as "load
to memory", if appropriate. These helper methods are listed in the
following section.
This method returns name given an LWP response object (and,
optionally, an overriding Content-Type). If option
suggest_disposition is TRUE, uses the header Content-Disposition
from the response as the suggested name, then passes the fields from
the response object to the method find_name_by_url().
This method returns $url modified by removing the parts related to access to parts of the resource. In particular, the fragment part is removed, as well as the query part if url_is_queryless() returns TRUE.
The method find_name_by_url() will return the return value of this
method (unless undef) immediately. Unless overriden, this method
returns the value of the hash option known_names indexed by the
$url. (By default this hash is empty.)
If the option only_known is true, it is a fatal error if $url is
not a key of this hash.
Returns the 0th approximation to the filename of the resource; the
return value has two parts: the principal part, and the query string
(undef if should not be used).
If $suggested is undefined, returns the path part of the $url, and the
query part, if present and if option use_query is TRUE). Otherwise
either returns $suggested, or (if options suggested_only_basename
and hierarchical are both true), returns the path part of the
$url with the last component changed to $suggested; the query part is
ignored in this case. In the latter case, if option suggested_basename is TRUE, only the last path component of $suggested is used.
Returns the filename $f with necessary character-by-character
translations performed. Unless overriden, it translates backslashes
to slashes if the option fix_url_backslashes is TRUE, replaces
characters matched by regular expression in the option protect by
their hexadecimal representation (with the leader being the value of
the option protect_pref), and replaces percent signs by the value
of the option protect_pref.
Returns $query with necessary character-by-character translations
performed. Unless overriden, it translates slashes, backslashes, and
characters matched byregular expression in the option protect by
their hexadecimal representation (with the leader being the value of
the option protect_pref), and replaces percent signs by the value
of the option protect_pref.
Returns a triple of the appropriate directory name, the relative filename, and a string to append to the filename, based on processed-so-far filename $f and the $query string.
Unless overriden, does the following: unless the option
hierarchical is TRUE, all but the last path components of $f are
ignored. If the option site_dir is TRUE, the host part of the URL
(as well as the port part - if non-standard) are prepended to the
filename. The leading backslash is always stripped, and the option
root is used as the lead components of the directory name. If
$query is defined, and the option dir_query is true, $f is used as
the last component of the directory, and $query as file name (with
option use_query prepended).
(Dirname is assumed to be /-terminated.)
Returns the provisional directory part of the filename. Unless
overriden, replaces empty components by the string empty preceeded
by the value of protect_pref option; then applies the method
fix_component() to each component of the directory.
A callback to process the calculated directory name. Unless
overriden, it creates the directory (with permissions per option
dir_mode) if the option mkpath is TRUE.
Actually, the directory name is the return value, so this is the last chance to change the directory name...
Breaks the last component $f of the filename into a pair of basename and suffix, which are returned. $dirname consists of other components of the filename, $append is the string to append to the basename in the future.
Suffix may be empty, and is supposed to contain the leading dot (if
applicable); it may contain more than one dot. Unless overriden, the
suffix consists of all trailing non-empty started-by-dot groups with
length no more than given by the option max_suff_len (not including
the leading dot).
Returns a pair of basename and appropriate suffix for a file. $f is the basename of the file, $suff is its suffix, $dirname consists of other components of file names, $append is the string to append to the basename.
Different strategies applicable to this problem are:
Any of these has two variants: whether we want the encodings reflected in the suffix, or not. Unless overriden, chosing strategy/variant consists of several rounds.
In the first round, choose user-specified suffix if $type is defined,
and is (lowercased) in the option-hashes type_suff and
type_suff_no_enc (choosing the variant based on which hash
matched). Keep the current suffix if $type is not defined, or option
keepsuff_same_mediatype is TRUE and the current suffix of the file
matches $type and $enc (per database of known types and encodings).
The second round runs if none of these was applicable. Choose
user-specified suffix if $type is (lowercased) in the hashes
type_suff_fallback or type_suff_fallback_no_enc (choosing
variant as above); keep the current suffix if the type (lowercased) is
in the hashes keep_nosuff or keep_suff (depending on whether
$suff is empty or not).
If none of these was applicable, the last round chooses the appropriate suffix by the database of known types and encodings; if not found, the existing suffix is preserved.
Returns a pair of basename and suffix for a file. $f is the last
component of the name of the file, $dirname consists of other
components. Unless overriden, this method replaces an empty basename
by "index" and applies fix_component() method to the basename;
finally, if '8+3' otion is set, it converts the filename and suffix
to a name suitable 8+3 filesystems.
Given a basename, extension, and the directory part of the filename, modifies the basename (if needed) to avoid duplicates; should return the complete file name (combining the dirname, basename, and suffix). Unless overriden, appends a number to the basename (shortening basename if needed) so that the result is unique.
This is a prime candidate for overriding (e.g., to ask user for confirmation of overwrite).
The callback method to register the found name. Unless overridden,
behaves like following: if option cache_name is TRUE, stores the
found name in the known_names hash. Otherwise just returns the found name.
Returns a suitably modified value of a path component of a filename.
The non-overriden method massages unescapes embedded SPACE characters;
it removes starting/trailing, and converts the rest to _ unless the
option keep_space is TRUE; removes trailing dots unless the option
keep_dots is TRUE; translates to lowercase if the option tolower
is TRUE, truncates to max_length if this option is set, and applies
the eight_plus_three() method if the option '8+3' is set.
Returns the value of filename modified for filesystems with 8+3 restriction on the filename (such as DOS). If $suffix is not given, calculates it from $fname; otherwise $suffix should include the leading dot, and $fname should have $suffix already removed. (Some parts of info may be moved between suffix and filename if judged appropriate.)
This method returns TRUE if the query part of the URL is selecting
a part of the resource (i.e., if it is behaves as a fragment part,
and it is the client which should process this part). Such URLs are
detected by $type (should be in hash option queryless_types), or by
extension of the last path component (should be in hash option
queryless_ext).
A class which behaves as Net::ChooseFName, but always returns
undef. For convenience, the constructor is duplicated as a class
method failer() in the class Net::ChooseFName.
None by default.
Documentation keeps mentioning "unless overriden"... Of course it is a generic remark applicable to any method of any class; however, please remember that methods of this class are designed to be overriden.
There is no protection against a wanted directory name being already taken by a file.
There is no restriction on length of overall file name, only on length of a component name.
LWP=libwww-perl
Ilya Zakharevich <ilyaz@cpan.org>
Copyright (C) 2005 by Ilya Zakharevich <ilyaz@cpan.org>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.2 or, at your option, any later version of Perl 5 you may have available.
| Net-ChooseFName documentation | view source | Contained in the Net-ChooseFName distribution. |