File::Info - Store file information persistently for fast lookup


File-Info documentation Contained in the File-Info distribution.

Index


Code Index:

NAME

Top

File::Info - Store file information persistently for fast lookup

SYNOPSIS

Top

  use File::Info qw( $PACKAGE $VERSION );

  my $info = File::Info->new($dir);
  # $fn is "basename"; contains no directory portion
  my $hex  = $info->md5hex($fn);  # Reads cached data if possible

DESCRIPTION

Top

This package stores per-file information for speedy lookup later. It is intended to store file info that takes a significant time to determine --- e.g., the MD5 sum of a large file, to avoid uneccessarily recalculation. This may be particularly helpful for searching across many files for some specific property.

File statistics are recalculated on demand. If the file size or modification time have changed since the calculations were last made, then they will be purged and recalculated.

File information is stored on a per-directory basis. Each file info file is stored in a directory; the files to which it refers are in the same directory, and are referred as names without paths.

CLASS CONSTANTS

Top

TYPE_CONSTANTS

As returned by the type|"type" method. These constants are exported by request, either individually, or together with the ':types' tag.

TYPE_UNKNOWN

File type not identified

TYPE_JPEG

A 'JPEG' image file.

TYPE_PAR

A 'par' (parity archive) file.

CLASS COMPONENTS

Top

CLASS HIGHER-LEVEL FUNCTIONS

Top

CLASS HIGHER-LEVEL PROCEDURES

Top

add_global_lookup

Add a lookup function to the. A method with the same name will be created, to provide the cached lookup.

ARGUMENTS

name

The name may consist only of letters, digits, and underscore characters. The first character must be a letter, and at least one digit or lower-case must be present.

builtin names will always be lower-case. If you stick to this, then you will need to make no change if your identifier should get absorbed into the core. On the other hand, if you use some upper-case letters (e.g., StudlyCaps), then you are assured that you will never clash will internal names.

These other names are reserved:

  add_local_lookup add_global_lookup isa import new dirname

code

The code to call to calculate the value. The code will be passed the absolute name of the file to lookup, and is expected to return a suitable value. The value will be cached.

INSTANCE CONSTRUCTION

Top

new

Create & return a new thing.

ARGUMENTS

_dirname

Name of the directory represented

INSTANCE COMPONENTS

Top

INSTANCE HIGHER-LEVEL FUNCTIONS

Top

dirname

The name of the directory to which this instance refers

STANDARD LOOKUPS

Top

Each of the following functions takes a filename (without path, relative to the directory of the instance), and returns the relevant value for the file.

Alternatively, they may be called as class methods, in which case the filename value must be absolute. This mode will never invoke a local method (see add_local_lookup|"add_local_lookup", and is less efficient if multiple lookups are made on files in the same directory.

md5_hex

The MD5 signature of the file, as 16 pairs of hex characters. The Digest::MD5 module (version 2 or above) is required to be present.

md5

The MD5 signature of the file, as a 16-byte binary value. The Digest::MD5 module (version 2 or above) is required to be present.

md5_16khex

The MD5 signature of the first 16k of the file, as 16 pairs of hex characters. The Digest::MD5 module (version 2 or above) is required to be present.

md5_16k

The MD5 signature of the first 16k of the file, file, as a 16-byte binary value. The Digest::MD5 module (version 2 or above) is required to be present.

line_count

The number of lines in the file. More acurrately, the number of "\n" characters in the file (as for wc). No attempt is made to guess the line terminator of the running system; for that would lead to inconsistent results on the same file on a (say) Samba-mounted drive accessed from both Windoze and UN*X.

type

The file type, as determined by reading the file itself. This is similar in intent to the file command under UN*X, with the following distinctions:

The returned value is a TYPE_x constant.

par_set_hash

Behaviour is defined only for files whose type|"type" is TYPE_PAR.

This is the hash used to identify par files that belong to a single set. It is a 16-byte binary file.

par_set_hash_hex

Behaviour is defined only for files whose type|"type" is TYPE_PAR.

As for par_set_hash|"par_set_hash", but a 16 pairs of hex characters representing the 16 bytes.

INSTANCE HIGHER-LEVEL PROCEDURES

Top

add_local_lookup

Add a lookup function to this instance only. A method with the same name will be created, to provide the cached lookup.

This method will only work on this instance. Any other instances with their own local methods will be respected. The local method will override any global method of the same name. However, using the class interface (e.g., File::Info->local($absname) will always invoke the global instance, if any (and fail, if not).

ARGUMENTS

name

The name may consist only of letters, digits, and underscore characters. The first character must be a letter, and at least one digit or lower-case must be present.

builtin names will always be lower-case. If you stick to this, then you will need to make no change if your identifier should get absorbed into the core. On the other hand, if you use some upper-case letters (e.g., StudlyCaps), then you are assured that you will never clash will internal names.

These other names are reserved:

  add_local_lookup add_global_lookup isa import new dirname

code

The code to call to calculate the value. The code will be passed the absolute name of the file to lookup, and is expected to return a suitable value. The value will be cached.

EXAMPLES

Top

BUGS

Top

REPORTING BUGS

Top

Email the author.

AUTHOR

Top

Martyn J. Pearce fluffy@cpan.org

COPYRIGHT

Top

SEE ALSO

Top


File-Info documentation Contained in the File-Info distribution.
# (X)Emacs mode: -*- cperl -*-

package File::Info;

# ----------------------------------------------------------------------------

# Pragmas -----------------------------

use 5.005_62;
use strict;
use warnings;
use warnings::register;

# Inheritance -------------------------

use constant _TYPE_NAMES => qw( TYPE_JPEG TYPE_PAR TYPE_UNKNOWN );
use base qw( Exporter );
our @EXPORT_OK = (qw( $PACKAGE $VERSION ), _TYPE_NAMES);
our %EXPORT_TAGS = (
                    types => [_TYPE_NAMES],
                   );

# Utility -----------------------------

use Fcntl                  1.03 qw( O_CREAT O_RDWR );
use File::Basename          2.6 qw( );
use File::Spec             0.82 qw( );
use MLDBM                  2.00 qw( DB_File Storable );
use Storable              1.014 qw( );

# ----------------------------------------------------------------------------

# CLASS METHODS --------------------------------------------------------------

# -------------------------------------
# CLASS CONSTANTS
# -------------------------------------

use constant TYPE_JPEG    => 'jpeg';
use constant TYPE_PAR     => 'par';
use constant TYPE_UNKNOWN => 'unknown';

# -------------------------------------

use constant MD5HEX_TEMPLATE => '%02x' x 16;
use constant MD5BIN_TEMPLATE => 'C16';
use constant FILENAME        => '.fileinfo';
use constant FORBIDDEN_NAMES => { map { $_ => 1 }
                                  qw( add_local_lookup add_global_lookup
                                      isa import new dirname ) };
use constant S16K            => 2 ** 14;

# -------------------------------------

our $PACKAGE = 'File-Info';
our $VERSION = '1.02';

# -------------------------------------
# CLASS CONSTRUCTION
# -------------------------------------

# -------------------------------------
# CLASS COMPONENTS
# -------------------------------------

my %local_names;

# -------------------------------------
# CLASS HIGHER-LEVEL FUNCTIONS
# -------------------------------------

my $digestive;
sub _MD5 {
  my ($fn) = @_;

  unless ( defined $digestive ) {
    eval "use Digest::MD5 2.00 qw( )";
    croak $@ if $@;
    $digestive = Digest::MD5->new;
  }
  open my $fh, '<', $fn
    or Carp::croak "Failed to open $fn: $!\n";
  my $md5 = $digestive->addfile($fh)->digest;
  close $fh
    or Carp::croak "Failed to close $fn after copying: $!\n";
  return $md5;
}

# -------------------------------------

sub _MD5_16K {
  my ($fn) = @_;

  unless ( defined $digestive ) {
    eval "use Digest::MD5 2.00 qw( )";
    croak $@ if $@;
    $digestive = Digest::MD5->new;
  }

  open my $fh, '<', $fn
    or Carp::croak "Failed to open $fn: $!\n";
  my $buffy = "\0" x S16K;
  read $fh, $buffy, S16K;
  my $md5 = $digestive->add($buffy)->digest;
  close $fh
    or Carp::croak "Failed to close $fn after copying: $!\n";
  return $md5;
}

# -------------------------------------

sub _LINE_COUNT {
  my ($fn) = @_;
  open my $fh, '<', $fn
    or Carp::croak "Failed to open $fn: $!\n";
  local $/ = \65536;
  my $count = 0;
  while (<$fh>) {
    $count += tr/\n/\n/;
  }
  close $fh
    or Carp::croak "Failed to close $fn after copying: $!\n";
  return $count;
}

# -------------------------------------

sub _TYPE {
  my ($fn) = @_;

  my $type = TYPE_UNKNOWN;

  open my $fh, '<', $fn
    or Carp::croak "Failed to open $fn: $!\n";
  my $buffy = "\0" x 8;
  read $fh, $buffy, 8;
  if ( $buffy eq "PAR\0\0\0\0\0" ) {
    $type = TYPE_PAR;
  } elsif ( unpack('n', substr($buffy, 0, 2)) eq 0xffd8 ) {
    $type = TYPE_JPEG;
  }
  close $fh
    or Carp::croak "Failed to close $fn after copying: $!\n";

  return $type;
}

# -------------------------------------

sub _PAR_SET_HASH {
  my ($fn) = @_;

  if ( defined $Archive::Par::VERSION ) {
    Carp::croak "Archive::Par version 1.52 or above required\n"
      unless $Archive::Par::VERSION >= 1.52;
  } else {
    eval "use Archive::Par 1.52 qw( )";
    croak $@ if $@;
  }

  my $par = Archive::Par->new($fn);
  return $par->set_hash;
}

# -------------------------------------
# CLASS HIGHER-LEVEL PROCEDURES
# -------------------------------------

# -------------------------------------

sub add_global_lookup {
  my $class = shift;
  my ($name, $code) = @_;

  Carp::croak
      "Lookup name $name must contain only alphanumerics and underscores\n"
    if $name =~ /[^A-Za-z0-9_]/;
  Carp::croak "Lookup name must begin with a letter\n"
    if $name !~ /^[A-Za-z]/;
  Carp::croak
      "Lookup name must contain at least one lowercase number or digit\n"
    unless $name =~ /[a-z0-9]/;

  Carp::croak "Name reserved: $name\n"
    if exists FORBIDDEN_NAMES->{$name};

  warnings::warnif("Name $name already in use; will supercede\n")
      if $class->can($name);

  my $subrname1 = join '::', $class, '_' . uc $name;
  my $subrname2 = join '::', $class, $name;
  {
    no strict 'refs';
    *{$subrname1} = $code;
  }
  $class->_make_class_ready($name, sub {$_[0]->_value($_[1], $name)});
}

# -------------------------------------

sub _make_class_ready {
  my $class = shift;
  my ($subrname, $subr) = @_;

  {
    no strict 'refs';
    if ( ! defined $subr ) {
      $subr = *{"${class}::${subrname}"}{CODE}
        or Carp::croak "Cannot locate subroutine $subrname in class $class\n";
    }

    no warnings 'redefine';

    my $file = __FILE__;
    my $warnhook = $SIG{__WARN__};
    my %redef_subr;

    local $SIG{__WARN__} = sub {
      # Nasty hack to avoid irritating mandatory redefine warnings bug
      return
        if ( ( $_[0] =~ /^Subroutine ([:\w]+) redefined at $file/ ) and 
             ( exists $redef_subr{$1} or 
               ( index($1,':') == -1 and exists $redef_subr{"main::$1"} )
             ) );
      my $message = join '', grep defined, @_;
      $warnhook->(@_)
        if defined $warnhook and UNIVERSAL::isa($warnhook, 'CODE');
    };

    $redef_subr{$subrname} = 1;
    *{"${class}::${subrname}"} =
      sub {
        if ( ref $_[0] ) {
          $_[0]->$subr(@_[1..$#_]);
        } else {
          Carp::croak
              (sprintf
               "File name (%s) must be absolute in class method call %s\n",
               $_[1], join '::', $class, $subrname)
            unless File::Spec->file_name_is_absolute($_[1]);
          my ($name, $path) = File::Basename::fileparse $_[1];
          $class->new($path)->$subr($name);
        }
      };
  }
}

# -------------------------------------
# CLASS UTILITY SUBROUTINES
# -------------------------------------

sub _md5hex {
  return sprintf MD5HEX_TEMPLATE, unpack MD5BIN_TEMPLATE, $_[0];
}

# INSTANCE METHODS -----------------------------------------------------------

# -------------------------------------
# INSTANCE CONSTRUCTION
# -------------------------------------

sub new {
  my $class = shift; $class = ref $class || $class;
  my ($dirname) = @_;

  tie(my %info, 'MLDBM',
      File::Spec->catfile($dirname, FILENAME), O_RDWR | O_CREAT, 0644);

  bless { _dirname      => File::Spec->rel2abs($dirname),
          _local_lookup => {},
          _info         => \%info,
        }, $class;
}

# -------------------------------------
# INSTANCE FINALIZATION
# -------------------------------------

sub DESTROY { untie %{$_[0]->{_info}}; }

# -------------------------------------
# INSTANCE COMPONENTS
# -------------------------------------

sub _value {
  my $self = shift; my $class = ref $self;
  my ($fn, $type) = @_;

  my ($vol, $dir, $file) = File::Spec->splitpath($fn);

  Carp::croak "Filename includes path: $fn\n"
    if length $vol or length $dir;

  my $abs_fn         = File::Spec->catfile($self->{_dirname}, $fn);
  my ($size, $mtime) = (stat($abs_fn))[7,9];

  my $info = exists $self->{_info}->{$fn} ? $self->{_info}->{$fn} : undef;
  if ( defined $info            and
       $info->{size}  eq $size  and
       $info->{mtime} eq $mtime ) {
    return $info->{$type}
      if exists $info->{$type};
  } else {
    $info = { size => $size, mtime => $mtime };
  }

  my $subr;
  if ( exists $self->{_local_lookup}->{$type} ) {
    $subr = $self->{_local_lookup}->{$type};
  } else {
    $subr = $class->can(join '', '_', uc $type)
      or Carp::croak(exists $local_names{$type}                          ?
                     "Value type $type not supported on this instance\n" :
                     "Unknown value type: $type\n"                       );
  }

  $info->{$type} = my $Result = $subr->($abs_fn);

  # Careful!  Adding to hashref is not enough to cause a persistent update
  # with MLDBM; we must re-set the value to the persistent hash.
  $self->{_info}->{$fn} = $info;

  return $Result;
}

# -------------------------------------
# INSTANCE HIGHER-LEVEL FUNCTIONS
# -------------------------------------

sub dirname { $_[0]->{_dirname} };

# -------------------------------------
# STANDARD LOOKUPS
# -------------------------------------

sub md5hex { _md5hex($_[0]->_value($_[1], 'md5')) }
sub md5_16khex { _md5hex($_[0]->_value($_[1], 'md5_16k')) }

sub par_set_hash_hex { _md5hex($_[0]->_value($_[1], 'par_set_hash')) }

# -------------------------------------
# INSTANCE HIGHER-LEVEL PROCEDURES
# -------------------------------------

sub add_local_lookup {
  my $self = shift; my $class = ref $self;
  my ($name, $code) = @_;

  Carp::croak "Lookup name $name must contain only alphanumerics and underscores\n"
    if $name =~ /[^A-Za-z0-9_]/;
  Carp::croak "Lookup name must begin with a letter\n"
    if $name !~ /^[A-Za-z]/;
  Carp::croak "Lookup name must contain at least one lowercase number or digit\n"
    unless $name =~ /[a-z0-9]/;

  Carp::croak "Name reserved: $name\n"
    if exists FORBIDDEN_NAMES->{$name};

  $self->{_local_lookup}->{$name} = $code;
  $local_names{$name} = 1;

  my $methname = join '::', $class, $name;
  {
    no strict 'refs';

    if ( ! defined *{$methname}{CODE} ) {
      *{$methname} =
        sub {
          my ($vol, $path) = File::Spec->splitpath($_[1]);
          Carp::croak("Local function $name may not be invoked " .
                      "with file with path: $_[1]\n")
            if length $vol or length $path;
          $_[0]->_value($_[1], $name);
        };
    }
  }
}

# ----------------------------------------------------------------------------

CHECK {
  no strict 'refs';
  # Ensure that every existing method (with a lowercase initial letter, not a
  # reserved name) handles class invocation
  __PACKAGE__->_make_class_ready($_)
    for grep(/^[a-z]/ && ! exists FORBIDDEN_NAMES->{$_},
             keys %{__PACKAGE__ . '::'});
  # If a method of the form _MD5 exists with no corresponding md5, fake up
  # the latter.
  for my $name (map(lc(substr $_,1),
                    grep( /^_(?!_)[A-Z0-9_]+$/            &&
                          ! exists FORBIDDEN_NAMES->{$_},
                          keys %{__PACKAGE__ . '::'}))) {
    __PACKAGE__->_make_class_ready($name, sub { $_[0]->_value($_[1], $name) });
  }
}

1; # keep require happy.

__END__