Search::Glimpse::Index - Interface to glimpseindex


Search-Glimpse documentation Contained in the Search-Glimpse distribution.

Index


Code Index:

NAME

Top

Search::Glimpse::Index - Interface to glimpseindex

SYNOPSIS

Top

  use Search::Glimpse::Index;

  my %opt = (
      timeindex   => 1,
      dryrun      => 0,
      indexall    => 0,
      indexnum    => 0,
      incremental => 0,
      structural  => 0,
      destdir     => "$ENV{HOME}/myindexes",
      stopword    => 90,     # must appear in 90% of files
  );
  my $indexer = Search::Glimpse::Index( %opt );

  $indexer->index("/path/to/folder/to/index");

DESCRIPTION

Top

This module is a Perl interface to glimpseindex binary. It (hopefully) makes easier to use the application from within Perl scripts or modules.

Available Methods

new

The constructor receives a hash with the indexing options to use. Note that all these values have sensible defaults (mosty, the glimpseindex defaults). Although I describe briefly what each option represent, I suggest to read the complete manpage for glimpseindex.

Known options are:

destdir (glimpseindex -H option)

This is the folder where glimpseindex will store its index files. This is also the path where you should put your exclude/include files. Future versions of this module might include an interface for those files.

dryrun (glimpseindex -I option)

This option is a boolean value, and sets whether glimpseindex should really index the files or just output the files that would be indexed in a real run.

bigindex (glimpseindex -b option)

glimpseindex has three different index sizes. By default the medium index is used (glimpseindex -o). Use this option for bigger indexes and (hopefully) faster results.

smallindex

glimpseindex has three different index sizes. By default the medium index is used (glimpseindex -o). Use this option for smaller indexes (not using any glimpseindex switch).

indexnum (glimpseindex -n option)

By default, tokens with digits are not indexed. Therefore, things like abc123 or a date will not be indexed. Use this option to force tokens with digits to be indexed.

indexall (glimpseindex -E option)

Makes glimpseindex to index all files, independently of their file type. Note that glimpseindex will honor .glimpse_exclude files.

timeindex (glimpseindex -t option)

This option is only available for glimpse version 3.5 or newer. It changes the order by which files are indexed. By default files are indexed in a mostly arbitraty order. With this option (which doesn't work in smallindex mode), the index will store files in a reversed order of modification time (recent files first). Therefore, results of queries are returned by this order, and glimpse is able t filter results by age.

incremental (glimpseindex -f option)

Useful if you have run a glimpseindex earlier and need to reindex. This option will perform an incremental indexing. If there is no current index or if this procedure fails, glimpseindex automatically reverts to the default mode (which is to index everything from scratch).

structural (glimpseindex -s option)

Use this option if you want to support structured queries.

swsize (glimpseindex -S option)

This option is used to control the amount of stop words to be considered. For further details on how the values of this option behave, please check glimpseindex manpage.

index

Use with a path to be indexed.

SEE ALSO

Top

perl(1)

AUTHOR

Top

Alberto Manuel Brandão Simões, <ambs@cpan.org>

COPYRIGHT AND LICENSE

Top


Search-Glimpse documentation Contained in the Search-Glimpse distribution.
package Search::Glimpse::Index;

use warnings;
use strict;

use Search::Glimpse::ConfigData;
use File::Path qw.make_path.;

our $VERSION = '0.01';

sub new {
    my $class = shift;
    my %ops = @_;

    make_path $ops{destdir}                           unless -d $ops{destdir};
    die "Can't use $ops{destdir}. Permission denied?" unless -d $ops{destdir};

    my $self = bless
      {
       timeindex   => $ops{timeindex}   ? "-t" : "",
       dryrun      => $ops{dryrun}      ? "-I" : "",
       indexall    => $ops{indexall}    ? "-E" : "",
       indexnum    => $ops{indexnum}    ? "-n" : "",
       incremental => $ops{incremental} ? "-f" : "",
       structural  => $ops{structural}  ? "-s" : "",
       destdir     => $ops{destdir}     ? "-H $ops{destdir}" : "",
       stopword    => $ops{swsize}      ? "-S $ops{swsize}" : "",
       indexsize   => $ops{smallindex}  ? ""   : ($ops{bigindex} ? "-b" : "-o"),

       bin         => Search::Glimpse::ConfigData->config('glimpseindex')
      } => $class;

    return $self;
}

# check how to support .glimpse files

# -z     Allow customizable filtering, using the file .glimpse_filters to
#        perform the programs listed there  for  each  match.   The  best
#        example is compress/decompress.  If .glimpse_filters include the
#        line
#        *.Z   uncompress <
#        (separated by tabs) then before indexing any file  that  matches
#        the  pattern "*.Z" (same syntax as the one for .glimpse_exclude)
#        the command listed is executed first  (assuming  input  is  from
#        stdin, which is why uncompress needs <) and its output (assuming
#        it goes to stdout) is indexed.  The file itself is  not  changed
#        (i.e.,  it  stays  compressed).  Then if glimpse -z is used, the
#        same program is used on these files on the fly.  Any program can
#        be  used (we run 'exec').  For example, one can filter out parts
#        of files that should not  be  indexed.   Glimpseindex  tries  to
#        apply  all  filters  in  .glimpse_filters  in the order they are
#        given.  For example, if you want to uncompress a file  and  then
#        extract  some part of it, put the compression command (the exam-
#        ple above) first  and  then  another  line  that  specifies  the
#        extraction.  Note that this can slow down the search because the
#        filters need to be run before files are searched.

# -B     uses  a  hash table that is 4 times bigger (256k entries instead
#        of 64K) to speed up indexing.  The memory  usage  will  increase
#        typically  by  about  2  MB.   This  option is only for indexing
#        speed; it does not affect the final index.

# -i     Make .glimpse_include (SEE GLIMPSEINDEX FILES)  take  precedence
#        over  .glimpse_exclude,  so  that,  for example, one can exclude
#        everything (by putting *) and then explicitly include files.

# -M x   Tells  glimpseindex  to use x MB of memory for temporary tables.
#        The more memory you allow the faster glimpseindex will run.  The
#        default  is  x=2.   The  value  of x must be a positive integer.
#        Glimpseindex will need more memory than x for other things,  and
#        glimpseindex may perform some 'forks', so you'll have to experi-
#        ment if you want to use this option.  WARNING: If x is too large
#        you may run out of swap space.

# -z     Allow customizable filtering, using the file .glimpse_filters to
#        perform the programs listed there  for  each  match.   The  best
#        example is compress/decompress.  If .glimpse_filters include the
#        line
#        *.Z   uncompress <
#        (separated by tabs) then before indexing any file  that  matches
#        the  pattern "*.Z" (same syntax as the one for .glimpse_exclude)
#        the command listed is executed first  (assuming  input  is  from
#        stdin, which is why uncompress needs <) and its output (assuming
#        it goes to stdout) is indexed.  The file itself is  not  changed
#        (i.e.,  it  stays  compressed).  Then if glimpse -z is used, the
#        same program is used on these files on the fly.  Any program can
#        be  used (we run 'exec').  For example, one can filter out parts
#        of files that should not  be  indexed.   Glimpseindex  tries  to
#        apply  all  filters  in  .glimpse_filters  in the order they are
#        given.  For example, if you want to uncompress a file  and  then
#        extract  some part of it, put the compression command (the exam-
#        ple above) first  and  then  another  line  that  specifies  the
#        extraction.  Note that this can slow down the search because the
#        filters need to be run before files are searched.


# sub index_files { }
#
# -F     Glimpseindex  receives  the list of files to index from standard
#        input.


sub index {
    my $self = shift;
    my $path = shift;

    my $commandline = join(" ",
                           $self->{bin},
                           $self->{timeindex},
                           $self->{dryrun},
                           $self->{indexall},
                           $self->{indexnum},
                           $self->{indexsize},
                           $self->{stopword},
                           $self->{destdir},
                           $self->{structural},
                           $self->{incremental},
                           $path);

    $ENV{LC_ALL} = 'C';
    my $output;
    open PIPE, "-|", $commandline or die "Can't execute glimpseindex";
    $output = join("" => <PIPE>);
    close PIPE;

    $self->{output} = $output || "";
    return $self;
}

#sub append { #... -a }

#sub delete { -d && -D (force) }

### PROBABLY NOTS

# -R     Recompute .glimpse_filenames_index from .glimpse_filenames.  The
#        file .glimpse_filenames_index speeds up processing.   Glimpsein-
#        dex  usually  computes  it  automatically.  However, if for some
#        reason one wants to change the path names of the files listed in
#        .glimpse_filenames,  then  running  glimpseindex  -R  recomputes
#        .glimpse_filenames_index.  This is useful if the index  is  com-
#        puted  on  one  machine,  but  is used on another (with the same
#        hierarchy).  The names of the files listed in .glimpse_filenames
#        are used in runtime, so changing them can be done at any time in
#        any way (as long as just the names not the content is  changed).
#        This  is  not really an option in the regular sense;  rather, it
#        is a program by itself, and it is  meant  as  a  post-processing
#        step.  (Avaliable only from version 3.6.)

# -w k   Glimpseindex does a reasonable, but not a perfect, job of deter-
#        mining which files should not be  indexed.   Sometimes  a  large
#        text  file  should not be indexed; for example, a dictionary may
#        match most queries.  The -w  option  stores  in  a  file  called
#        .glimpse_messages  (in the same directory as the index) the list
#        of all files that contribute at least k new words to the  index.
#        The  user can look at this list of files and decide which should
#        or should not be indexed.  The  file  .glimpse_exclude  contains
#        files  that  will not be indexed (see more below).  We recommend
#        to set k to about 1000.  This is  not  an  exact  measure.   For
#        example,  if  the  same file appears twice, then the second copy
#        will not contribute any new words to the dictionary (but if  you
#        exclude  the  first  copy  and index again, the second copy will
#        contribute).

# -X     (starting at version 4.0B1) Extract titles from HTML  pages  and
#        add the titles to the index (in .glimpse_filenames).  (This fea-
#        ture was added to improve the performance of WebGlimpse.)  Works
#        only  on  files  whose  names  end with .html, .htm, .shtml, and
#        .shtm.  (see glimpse.h/EXTRACT_INFO_SUFFIX to add to these  suf-
#        fixes.)   The  routine to extract titles is called extract_info,
#        in index/filetype.c.  This feature can be  modified  in  various
#        ways  to  extract  info  from  many  filetypes.   The titles are
#        appended to the corresponding filenames with a space  separator.
#        Glimpseindex assumes that filenames don't have spaces in them.



!0;