| Search-Circa documentation | view source | Contained in the Search-Circa distribution. |
Search::Circa::Parser - provide functions to parse HTML pages by Circa
use Search::Circa::Indexer;
my $index = new Search::Circa::Indexer;
$index->connect(...);
$index->Parser->look_at({ url => url,
idr => account });
This module use HTML::Parser facilities. It's call by Search::Circa::Indexer
for index each document. Main method is look_at.
Create a new Circa::Parser object with indexer instance properties
Index an url. Job done is:
Keys for refHashParameters:
Url to read
Id of url in table links
Id of account's url
(optional) : If this parameter is set, Circa didn't make any job on this page if it's older that the date.
(optional) Local url to reach the file
(optional) If $categorieAuto set to true, Circa will create/set the category of url with syntax of directory found. Ex: http://www.alianwebserver.com/societe/stvalentin/index.html will create and set the category for this url to Societe / StValentin. If $categorieAuto set to false, $categorie will be used.
(optional) Depth of actual link.
(optional) See $categorieAuto.
Return (-1,0) if url isn't valide, number of word and number of links found else
Set user agent for Circa robot. If local is set to 0 or $self->{ConfigMoteur}->{'temporate'}==0, LWP::UserAgent will be used. Else LWP::RobotUA is used.
Split data in words, and put them in global %$RM with score. Hash structure is ('mots'=>facteur).
Buffer to analyse
Basic score for each word
Method call for each HTML tag find in HTML pages.
Method call for each content of tag in HTML pages
Check if url $links will be add to Circa. Url must begin with $self->host_indexed, and his extension must be not doc,zip,ps,gif,jpg,gz, pdf,eps,png,deb,xls,ppt,class,GIF,css,js,wav,mid.
If $links is accepted, return url. Else return 0.
$Revision: 1.27 $
Search::Circa::Indexer
Alain BARBET alian@alianwebserver.com
| Search-Circa documentation | view source | Contained in the Search-Circa distribution. |