MMM::Text::Search - Perl module for indexing and searching text files and web objects


MMM-Text-Search documentation  | view source Contained in the MMM-Text-Search distribution.

Index


NAME

Top

MMM::Text::Search - Perl module for indexing and searching text files and web objects

SYNOPSIS

Top

  use MMM::Text::Search;

  my $srch = new MMM::Text::Search {	#for indexing...
	#index main file location...  
		IndexPath => "/tmp/myindex.db",
	#local files... (optional)
		FileMask  => '(?i)(\.txt|\.htm.?)$',
		Dirs	  => [ "/usr/doc", "/tmp" ] ,
		FollowSymLinks => 0|1, (default = 0)
	#web objects... (optional)
		URLs	  => [ "http://localhost/", ... ],
		Level	  => recursion-level (0=unlimited)		
	#common options...		
		IgnoreLimit =>	0.3,   (default = 2/3)
		Verbose => 0|1				
  	};

  $srch->start_indexing_session();

  $srch->commit_indexing_session();

  $srch->index_default_locations();

  $srch->index_content( { title =>   '...', 
		    	  content=>  '...', 
		    	  id =>      '...'  } );

  $srch->makeindex;
       (Obsolete.) 




	


	





  my $srch = new MMM::Text::Search (  #for searching....
		  "/tmp/myindex.db", verbose_flag );

  my $hashref = $srch->query("pizza","ciao", "-pasta" );  
  my $hashref = $srch->advanced_query("(pizza OR ciao) AND NOT pasta");  

  $srch->errstr()	# returns last error 
			# (only query syntax-errors for the moment being)

  


  $srch->dump_word_stats(\*FH)	

DESCRIPTION

Top

When a session is closed the following files will have been created (assuming IndexPath = /path/myindex.db, see constructor):



	/path/myindex.db	     word index database
	/path/myindex-locations.db   filename/URL database
	/path/myindex-titles.db	     html title database
	/path/myindex.stopwords	     stop-words list
	/path/myindex.filelist	     readable list of indexed files/URLs
	/path/myindex.deadlinks	     broken http links

	[... lots of important things missing ... ]

start_indexing_session() starts session.

commit_indexing_session() commits and closes current session.

index_default_locations() indexes all files and URLs specified on construction.

index_content() pushes content into indexing engine. Argument must have the following structure

 { title =>   '...', content=>  '...', id =>      '...'  }




makeindex() is obsolete. Equivalent to: $srch->start_indexing_session(); $srch->index_default_locations(); $srch->commit_indexing_session();

dump_word_stats(\*FH) dumps all words sorted by occurence frequency using FH file handle (or STDOUT if no parameter is specified). Stop-words get a frequency value of 1.

Both query() and advanced_query() return a reference to a hash with the following structure:

	(
	 ignored  => [ string, string, ... ], # ignored words
	 searched => [ string, string, ... ], # words searched for
	 entries    => [  hashref, hashref, ... ] # list of records 
						# found
	 )

The 'entries' element is a reference to an array of hashes, each having the following structure:

	(
 	 location => string,  # file path or URL or anything
	 score    => number,  # score 
	 title    => string   # HTML title		 
	)

NOTES

Top

Note on implementation: The technique used for indexing is substantially derived from that exposed by Tim Kientzle on Dr. Dobbs magazine.

BUGS

Top

Many, I guess.

AUTHOR

Top

Max Muzi <maxim@comm2000.it>

SEE ALSO

Top

perl(1).


MMM-Text-Search documentation  | view source Contained in the MMM-Text-Search distribution.