File::Locate::Iterator - read "locate" database with an iterator


File-Locate-Iterator documentation Contained in the File-Locate-Iterator distribution.

Index


Code Index:

NAME

Top

File::Locate::Iterator -- read "locate" database with an iterator

SYNOPSIS

Top

 use File::Locate::Iterator;
 my $it = File::Locate::Iterator->new;
 while (defined (my $entry = $it->next)) {
   print $entry,"\n";
 }

DESCRIPTION

Top

File::Locate::Iterator reads a "locate" database file in iterator style. Each next() call on the iterator returns the next entry from the database.

    /
    /bin
    /bin/bash
    /bin/cat

Locate databases normally hold filenames as a way of finding files by name faster than churning through all directories. Optional glob, suffix and regexp options on the iterator can restrict the entries returned.

See examples/native.pl in the File-Locate-Iterator sources for a simple sample read, or for a examples/mini-locate.pl whole locate program simulation.

Only "LOCATE02" format files are supported, per current versions of GNU locate, not the previous "slocate" format.

Iterators from this module are stand-alone, they don't need any of the various Perl iterator frameworks. But see Iterator::Locate, Iterator::Simple::Locate and MooseX::Iterator::Locate to inter-operate with those frameworks, with ways to grep, map and otherwise manipulate iterations.

Forks and Threads

If an iterator using a file handle is cloned to a new thread or a process level fork() then generally it can be used by the parent or the child but not both. The underlying file descriptor position is shared by parent and child, so when one of them reads it upsets the position for the other. This sort of thing affects almost all code working with file handles across fork and threads. Perhaps some thread CLONE code could let threads work correctly (if slower), but a fork is probably doomed.

Iterators using mmap work correctly for both forks and threads, except that the if_sensible size calculation and sharing is not thread-aware beyond the mmaps existing when the thread is spawned. File::Map knows the mmaps across all threads, but won't reveal them.

Taint Mode

In taint mode (see Taint Mode in perlsec) entries from a file or file handle are always tainted, the same as other file input. Taintedness of a database_str content string propagates to the entries.

For database_str_ref the initial taintedness propagates. In the unlikely event you untaint it during iteration the entries remain tainted because they depend or may depend on the data back from when the input was tainted. A rewind() will reset the taintedness though.

For reference, taint mode is only a small slowdown for the XS iterator code, and usually (it seems) only a little more for the pure perl.

Other Notes

The locate database format is only designed to be read forwards, hence no prev method on the iterator. The start of a previous record can't be distinguished by its content, and the "front coding" means the state at a given point may depend on records an arbitrary distance back too. A "tell" which gave file position plus state would be possible, though perhaps a "clone" of the whole iterator would be more use.

On some systems mmap may be a bit too effective, giving a process more of the CPU than other processes which make periodic read system calls. This is a matter of OS scheduling, but you might have to turn down the nice or ionice if doing a lot of mmapped work (see nice(1), ionice(1), setpriority in perlfunc, ioprio_set(2)).

FUNCTIONS

Top

Constructor

$it = File::Locate::Iterator->new (key=>value,...)

Create and return a new locate database iterator object. The following optional key/value pairs can be given,

database_file (string, default the system locate database)
database_fh (handle ref)

The file to read, either as filename or file handle. The default file is the default_database_file() below.

    $it = File::Locate::Iterator->new
            (database_file => '/foo/bar.db');

A filehandle is read with the usual PerlIO so it can use layers and come from various sources, but it should be in binary mode (see binmode in perlfunc and :raw in perlio).

database_str (string)
database_str_ref (ref to string)

The database contents to read in the form of a byte string.

    $it = File::Locate::Iterator->new
      (database_str => "\0LOCATE02\0\0/hello\0\006/world\0");

A database_str ends up copied into the iterator. database_str_ref is a scalar ref to the string and is not copied.

    my $str = "\0LOCATE02\0\0/hello\0\006/world\0";
    $it = File::Locate::Iterator->new
      (database_str_ref => \$str);

suffix (string)
suffixes (arrayref of strings)
glob (string)
globs (arrayref of strings)
regexp (string or regexp object)
regexps (arrayref of strings or regexp objects)

Restrict the entries returned to those with given suffix(es) or matching the given glob(s) or regexp(s). For example,

    # C code files on the system, .c and .h
    $it = File::Locate::Iterator->new
            (suffixes => ['.c','.h']);

If multiple patterns or suffixes are given then matches of any are returned.

Globs are in the style of the locate program which means fnmatch with no options (see File::FnMatch) and the pattern match of the full entry if there's wildcards ("*", "?" or "[") or of any part if a fixed string.

    glob => '*.c'  # .c files, no .cxx files
    glob => '.c'   # fixed str, .cxx matches too

Globs should be byte strings (not wide chars) since that's how the database entries are handled, and suspect fnmatch has no notion of charset coding for its strings and patterns.

use_mmap (string, default "if_sensible")

Whether to use mmap to access the database. This is fast and resource-efficient when available. To use mmap you must have the File::Map module, the file must fit in available address space, and for a database_fh handle there mustn't be any transforming PerlIO layers. The use_mmap choices are

    undef           \
    "default"       | use mmap if sensible
    "if_sensible"   /
    "if_possible"   use mmap if possible, otherwise file I/O
    0               don't use mmap
    1               must use mmap, croak if cannot




Setting default, undef or omitted means if_sensible. if_sensible uses mmap if available, and the file size is reasonable, and for database_fh if it isn't already using an :mmap layer. if_possible uses mmap whenever it can be done, without those qualifiers.

    $it = File::Locate::Iterator->new
            (use_mmap => 'if_possible');

When multiple iterators access the same file they share the mmap. The size check for if_sensible counts space in all File::Locate::Iterator mappings and won't go beyond 1/5 of available data space, which is assumed to be a quarter of the wordsize, so for a 32-bit system a total at most 200Mb. if_possible and if_sensible will only mmap ordinary files because generally the file size on char specials is not reliable.

$filename = File::Locate::Iterator->default_database_file()

Return the default database file used for new above. This is meant to be the same as the locate program uses and currently means

    $ENV{'LOCATE_PATH'}            if that env var set
    /var/cache/locate/locatedb     otherwise

Perhaps in the future it might be possible to check how findutils has been installed rather than assuming /var/cache/locate/.

Operations

$entry = $it->next

Return the next entry from the database, or no values at end of file. No values means undef in scalar context or an empty list in array context so you can loop with either

    while (defined (my $filename = $it->next)) ...

or

    while (my ($filename) = $it->next) ...

The return is a byte string since it's normally a filename and Perl handles filenames as byte strings.

$it->rewind

Rewind $it back to the start of the database. The next $it->next call will return the first entry.

This is only possible when the underlying database file or handle is seekable, ie. seek() works, which will usually mean a plain file, or PerlIO layers with seek support.

ENVIRONMENT VARIABLES

Top

LOCATE_PATH

Default locate database.

FILES

Top

/var/cache/locate/locatedb

Default locate database, if LOCATE_PATH environment variable not set.

OTHER WAYS TO DO IT

Top

File::Locate reads a locate database with callbacks instead. Whether you want callbacks or an iterator is a matter of personal preference. Iterators let you write your own loop and have multiple searches in progress simultaneously.

The speed of an iterator is about the same as callbacks when File::Locate::Iterator is built with its XSUB code (which requires Perl 5.10.0 or higher currently).

Iterators are good for cooperative coroutining like POE or Gtk where state must be held in some sort of variable to be progressed by calls from the main loop. Note that next() blocks on reading from the database, so the database should generally be a plain file rather than a socket or something, so as not to hold up a main loop.

If you have the recommended File::Map module then iterators share an mmap of the database file. Otherwise currently each holds a separate open handle to the database which means a file descriptor and PerlIO buffering per iterator. Sharing a handle and making each seek to its desired position would be possible, but a seek drops buffered data and so would go slower. Some PerlIO or IO::Handle trickery might transparently share an fd and keep buffered blocks from multiple file positions.

SEE ALSO

Top

Iterator::Locate, Iterator::Simple::Locate, MooseX::Iterator::Locate

File::Locate, locate(1) and the GNU Findutils manual, File::FnMatch, File::Map

HOME PAGE

Top

http://user42.tuxfamily.org/file-locate-iterator/index.html

COPYRIGHT

Top


File-Locate-Iterator documentation Contained in the File-Locate-Iterator distribution.

# Copyright 2009, 2010, 2011 Kevin Ryde.
#
# This file is part of File-Locate-Iterator.
#
# File-Locate-Iterator is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as published
# by the Free Software Foundation; either version 3, or (at your option)
# any later version.
#
# File-Locate-Iterator is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
# Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with File-Locate-Iterator; see the file COPYING.  Failing that, go to
# <http://www.gnu.org/licenses/>.


package File::Locate::Iterator;
use 5.006;  # for qr//, and open anon handles
use strict;
use warnings;
use Carp;

use DynaLoader;
our @ISA = ('DynaLoader');

our $VERSION = 20;

# uncomment this to run the ### lines
#use Devel::Comments;


if (eval { __PACKAGE__->bootstrap($VERSION) }) {
  ### FLI next() from XS
} else {
  ### FLI next() in perl, XS didn't load: $@

  eval "#line ".(__LINE__+1)." \"".__FILE__."\"\n" . <<'HERE' or die;
use strict;
use warnings;
use File::FnMatch;

sub _UNEXPECTED_EOF {
  my ($self) = @_;
  undef $self->{'entry'};
  croak 'Invalid database contents (unexpected EOF)';
}
sub _ERROR_READING {
  my ($self) = @_;
  undef $self->{'entry'};
  croak 'Error reading database: ',$!;
}
sub _BAD_SHARE {
  my ($self, $sharelen) = @_;
  undef $self->{'entry'};
  croak "Invalid database contents (bad share length $sharelen)";
}
sub next {
  my ($self) = @_;
  ### FLI PP next()

  my $sharelen = $self->{'sharelen'};
  my $entry = $self->{'entry'};
  my $regexp = $self->{'regexp'};
  my $globs = $self->{'globs'};

  if (my $mref = $self->{'mref'}) {
    my $pos = $self->{'pos'};
  MREF_LOOP: for (;;) {
      #### pos in map: sprintf('%#x', $pos)
      if ($pos >= length ($$mref)) {
        undef $self->{'entry'};
        return; # end of file
      }

      my ($adjshare) = unpack 'c', substr ($$mref, $pos++, 1);
      if ($adjshare == -128) {
        #### 2byte pos: sprintf('%#X', $pos)
        # print ord(substr ($$mref,$pos,1)),"\n";
        # print ord(substr ($$mref,$pos+1,1)),"\n";

        if ($pos+2 > length ($$mref)) { goto &_UNEXPECTED_EOF; }

        # for perl 5.10 up could use 's>' for signed 16-bit big-endian,
        # instead of getting unsigned and stepping down
        ($adjshare) = unpack 'n', substr ($$mref, $pos, 2);
        if ($adjshare >= 32768) { $adjshare -= 65536; }

        $pos += 2;
      }
      ### $adjshare
      $sharelen += $adjshare;
      # print "share now $sharelen\n";
      if ($sharelen < 0 || $sharelen > length($entry)) {
        push @_, $sharelen; goto &_BAD_SHARE;
      }

      my $end = index ($$mref, "\0", $pos);
      # print "$pos to $end\n";
      if ($end < 0) { goto &_UNEXPECTED_EOF; }

      $entry = (substr($entry,0,$sharelen)
                . substr ($$mref, $pos, $end-$pos));
      $pos = $end + 1;

      if ($regexp) {
        last if $entry =~ $regexp;
      } elsif (! $globs) {
        last;
      }
      if ($globs) {
        foreach my $glob (@$globs) {
          last MREF_LOOP if File::FnMatch::fnmatch($glob,$entry)
        }
      }
    }
    $self->{'pos'} = $pos;

  } else {
    local $/ = "\0"; # readline() to \0

    my $fh = $self->{'fh'};
    ### pos tell(fh): sprintf('%#x',tell($fh))
  IO_LOOP: for (;;) {
      my $adjshare;
      unless (my $got = read $fh, $adjshare, 1) {
        if (defined $got) {
          undef $self->{'entry'};
          return; # EOF
        }
        goto &_ERROR_READING;
      }

      ($adjshare) = unpack 'c', $adjshare;
      if ($adjshare == -128) {
        my $got = read $fh, $adjshare, 2;
        if (! defined $got) { goto &_ERROR_READING; }
        if ($got != 2) { goto &_UNEXPECTED_EOF; }

        # for perl 5.10 up could use 's>' for signed 16-bit big-endian
        # pack, instead of getting unsigned and stepping down
        ($adjshare) = unpack 'n', $adjshare;
        if ($adjshare >= 32768) { $adjshare -= 65536; }
      }
      ### $adjshare

      $sharelen += $adjshare;
      ### share now: $sharelen
      if ($sharelen < 0 || $sharelen > length($entry)) {
        push @_, $sharelen; goto &_BAD_SHARE;
      }

      my $part;
      {
        # perlfunc.pod of 5.10.0 for readline() says you can clear $!
        # then check it afterwards for an error indication, but that's
        # wrong, $! ends up set to EBADF when filling the PerlIO buffer,
        # which means if the readline crosses a 1024 byte boundary
        # (something in attempting a fast gets then falling back ...)

        $part = readline $fh;
        if (! defined $part) { goto &_UNEXPECTED_EOF; }

        ### part: $part
        chomp $part or goto &_UNEXPECTED_EOF;
      }

      $entry = substr($entry,0,$sharelen) . $part;

      if ($regexp) {
        last if $entry =~ $regexp;
      } elsif (! $globs) {
        last;
      }
      if ($globs) {
        foreach my $glob (@$globs) {
          last IO_LOOP if File::FnMatch::fnmatch($glob,$entry)
        }
      }
    }
  }

  $self->{'sharelen'} = $sharelen;
  return ($self->{'entry'} = $entry);
}

1;

HERE
}

use constant default_use_mmap => 'if_sensible';
my $header = "\0LOCATE02\0";


# Default path these days is /var/cache/locate/locatedb.
#
# Back in findutils 4.1 it was $(localstatedir)/locatedb, but there seems to
# have been no way to ask about the location.
#
sub default_database_file {
  # my ($class) = @_;
  if (defined (my $env = $ENV{'LOCATE_PATH'})) {
    return $env;
  } else {
    return '/var/cache/locate/locatedb';
  }
}

# The fields, all meant to be private, are:
#
# regexp
#     qr// regexp of all the 'regexp', 'regexps', 'suffix' and 'suffixes'
#     parameters.  If no such matches then no such field.  When the field
#     exists an entry must match the regexp or is skipped.
#
# globs
#     arrayref of strings which are globs to fnmatch().  If no globs then no
#     such field.  When the field exists an entry must match at least one of
#     the globs.
#
# mref
#     Ref to a scalar which holds the database contents, or undef if using
#     fh instead.  It's either a ref to the 'database_str' parameter passed
#     in, or a ref to a scalar created as an mmap of the file.  The mmap one
#     is shared among iterators through the File::Locate::Iterator::FileMap
#     caching.
#
# fh
#     When mref is undef, ref file handle which is to be read from,
#     otherwise no such field.  This can be either the 'database_fh'
#     parameter or an opened anonymous handle of the 'database_file'
#     parameter.
#
#     When mmap is used the 'database_fh' is not held here.  The mmap is
#     made (or rather, looked up in the FileMap cache), and the handle is
#     then no longer needed and can be closed or garbage collected in the
#     caller.
#
# fh_start
#     When fh is set, the tell($fh) position just after the $header in that
#     fh.  This is where to seek() back to for a $it->rewind.  If tell()
#     failed then this is -1 and $it->rewind is not possible.
#
#     Normally fh_start is simply length($header) for a database starting at
#     the start of the file, but a database_fh arg which is positioned at
#     some offset into a file can be read and remembering an fh_start
#     position lets $it->rewind work on it too.
#
# fm
#     When using mmap, a File::Locate::Iterator::FileMap object which is the
#     cache entry for the database file, otherwise no such field.  This is
#     hung onto to keep it alive while in use.  $self->{'mref'} is
#     $fm->mmapref in this case.
#
# pos
#     When mref is not undef, an integer offset into the $$mref string which
#     is the current read position.  The file header is checked in new() so
#     the initial value is length($header), ie. 10, the position of the
#     first entry (or possibly EOF).
#
# entry
#     String of the last database entry returned, or no such field before
#     the first is read, or undef after EOF is hit.  Might be undef instead
#     of not existing if a hypothetical seek() goes back to the start of the
#     file.
#
# sharelen
#     Integer which is the number of leading bytes of 'entry' which the next
#     entry will share with that previous entry.  Initially 0.
#
#     This is modified successively by the "adjshare" of each entry as each
#     takes more or less of the preceding entry.  An adjshare can range from
#     -sharelen to take nothing at all of the previous entry, up to
#     length($entry)-sharelen to increment up to take all of the previous
#     entry.
#
sub new {
  my ($class, %options) = @_;
  ### FLI new(): %options

  # delete 'regexp' field if it's undef, as the XS code wants no 'regexp'
  # field for no regexps, not a field set to undef
  my @regexps;
  if (defined (my $regexp = delete $options{'regexp'})) {
    push @regexps, $regexp;
  }
  if (my $regexps = delete $options{'regexps'}) {
    push @regexps, @$regexps;
  }
  foreach my $suffix (defined $options{'suffix'} ? $options{'suffix'} : (),
                      @{$options{'suffixes'}}) {
    push @regexps, quotemeta($suffix) . '$';
  }
  ### @regexps

  # as per findutils locate.c locate() function, pattern with * ? or [ is a
  # glob, anything else is a literal match
  #
  my @globs = (defined $options{'glob'} ? $options{'glob'} : (),
               @{$options{'globs'} || []});
  @globs = grep { ($_ =~ /[[*?]/
                   || do { push @regexps, quotemeta($_); 0 })
                } @globs;
  ### @globs

  my $self = bless { entry    => '',
                     sharelen => 0,
                   }, $class;

  if (@regexps) {
    my $regexp = join ('|', @regexps);
    $self->{'regexp'} = qr/$regexp/s;
  }
  if (@globs) {
    $self->{'globs'} = \@globs;
  }

  ### regexp: $self->{'regexp'}
  ### globs : $self->{'globs'}

  if (defined (my $ref = $options{'database_str_ref'})) {
    $self->{'mref'} = $ref;

  } elsif (defined $options{'database_str'}) {
    $self->{'mref'} = \$options{'database_str'};

  } else {
    my $use_mmap = (defined $options{'use_mmap'}
                    ? $options{'use_mmap'}
                    : $class->default_use_mmap);
    ### $use_mmap
    if ($use_mmap) {
      if (! eval { require File::Locate::Iterator::FileMap }) {
        ### FileMap not possible: $@
        $use_mmap = 0;
      }
    }

    my $fh = $options{'database_fh'};
    if (defined $fh) {
      if ($use_mmap eq 'if_sensible'
          && File::Locate::Iterator::FileMap::_have_mmap_layer($fh)) {
        ### already have mmap layer, not sensible to mmap again
        $use_mmap = 0;
      }
    } else {
      my $file = (defined $options{'database_file'}
                  ? $options{'database_file'}
                  : $class->default_database_file);
      ### open database_file: $file

      # Crib note: '<:raw' means without :perlio buffering, whereas
      # binmode() preserves that buffering, assuming it's in the $ENV{'PERLIO'}
      # defaults.  Also :raw is not available in perl 5.6.
      open $fh, '<', $file
        or croak "Cannot open $file: $!";
      binmode($fh)
        or croak "Cannot set binary mode";
    }

    if ($use_mmap eq 'if_sensible') {
      $use_mmap = (File::Locate::Iterator::FileMap::_mmap_size_excessive($fh)
                   ? 0
                   : 'if_possible');
      ### if_sensible after size check becomes: $use_mmap
    }

    if ($use_mmap) {
      ### attempt mmap: $fh, (-s $fh)

      # There's many ways an mmap can fail, just chuck an eval on FileMap /
      # File::Map it to catch them all.
      # - An ordinary readable file of length zero may fail per POSIX, and
      #   that's how it is in linux kernel post 2.6.12.  However File::Map
      #   0.20 takes care of returning an empty string for that.
      # - A char special usually gives 0 for its length, even for instance
      #   linux kernel special files like /proc/meminfo.  Char specials can
      #   often be mapped perfectly well, but without a length don't know
      #   how much to look at.  For that reason if_possible restricts to
      #   ordinary files, though forced use_mmap=>1 just goes ahead anyway.
      #
      if ($use_mmap eq 'if_possible') {
        if (! -f $fh) {
          ### if_possible, not a plain file, consider not mmappable
        } else {
          if (! eval { $self->{'fm'}
                         = File::Locate::Iterator::FileMap->get($fh) }) {
            ### mmap failed: $@
          }
        }
      } else {
        $self->{'fm'} = File::Locate::Iterator::FileMap->get($fh);
      }
    }
    if ($self->{'fm'}) {
      $self->{'mref'} = $self->{'fm'}->mmap_ref;
    } else {
      $self->{'fh'} = $fh;
    }
  }

  if (my $mref = $self->{'mref'}) {
    unless ($$mref =~ /^\Q$header/o) { goto &_ERROR_BAD_HEADER }
    $self->{'pos'} = length($header);
  } else {
    my $got = '';
    read $self->{'fh'}, $got, length($header);
    if ($got ne $header) { goto &_ERROR_BAD_HEADER }
    $self->{'fh_start'} = tell $self->{'fh'};
  }

  return $self;
}
sub _ERROR_BAD_HEADER {
  croak 'Invalid database contents (no LOCATE02 header)';
}

sub rewind {
  my ($self) = @_;

  $self->{'sharelen'} = 0;
  $self->{'entry'} = '';
  if ($self->{'mref'}) {
    $self->{'pos'} = length($header);
  } else {
    $self->{'fh_start'} > 0
      or croak "Cannot seek database";
    seek ($self->{'fh'}, $self->{'fh_start'}, 0)
      or croak "Cannot seek database: $!";
  }
}

# return true if mmap is in use
# (an actual mmap, not the slightly similar 'database_str' option)
# this is meant for internal use as a diagnostic ...
sub _using_mmap {
  my ($self) = @_;
  return defined $self->{'fm'};
}

# Not yet documented, likely worthwhile as long as it works properly.
# Return empty list for nothing yet?  Same as next().
# Return empty list at EOF?  At EOF 'entry' is undefed out.
#
# =item C<< $entry = $it->current >>
#
# Return the current entry from the database, meaning the same as the last
# call to C<next> returned.  At the start of the database (before the first
# C<next>) or at end of the database the return is an empty list.
#
#     while (defined $it->next) {
#         ...
#         print $it->current,"\n";
#     }
#
sub _current {
  my ($self) = @_;
  if (defined $self->{'entry'}) {
    return $self->{'entry'};
  } else {
    return;
  }
}


1;
__END__