| Search-Circa documentation | Contained in the Search-Circa distribution. |
Search::Circa::Parser - provide functions to parse HTML pages by Circa
use Search::Circa::Indexer;
my $index = new Search::Circa::Indexer;
$index->connect(...);
$index->Parser->look_at({ url => url,
idr => account });
This module use HTML::Parser facilities. It's call by Search::Circa::Indexer
for index each document. Main method is look_at.
Create a new Circa::Parser object with indexer instance properties
Index an url. Job done is:
Keys for refHashParameters:
Url to read
Id of url in table links
Id of account's url
(optional) : If this parameter is set, Circa didn't make any job on this page if it's older that the date.
(optional) Local url to reach the file
(optional) If $categorieAuto set to true, Circa will create/set the category of url with syntax of directory found. Ex: http://www.alianwebserver.com/societe/stvalentin/index.html will create and set the category for this url to Societe / StValentin. If $categorieAuto set to false, $categorie will be used.
(optional) Depth of actual link.
(optional) See $categorieAuto.
Return (-1,0) if url isn't valide, number of word and number of links found else
Set user agent for Circa robot. If local is set to 0 or $self->{ConfigMoteur}->{'temporate'}==0, LWP::UserAgent will be used. Else LWP::RobotUA is used.
Split data in words, and put them in global %$RM with score. Hash structure is ('mots'=>facteur).
Buffer to analyse
Basic score for each word
Method call for each HTML tag find in HTML pages.
Method call for each content of tag in HTML pages
Check if url $links will be add to Circa. Url must begin with $self->host_indexed, and his extension must be not doc,zip,ps,gif,jpg,gz, pdf,eps,png,deb,xls,ppt,class,GIF,css,js,wav,mid.
If $links is accepted, return url. Else return 0.
$Revision: 1.27 $
Search::Circa::Indexer
Alain BARBET alian@alianwebserver.com
| Search-Circa documentation | Contained in the Search-Circa distribution. |
package Search::Circa::Parser; # module Circa::Parser : See Circa::Indexer # Copyright 2000 A.Barbet alian@alianwebserver.com. All rights reserved. # $Log: Parser.pm,v $ # Revision 1.27 2003/01/02 00:32:40 alian # Add url found with meta http-equiv refresh # # Revision 1.26 2002/12/31 09:58:52 alian # Use hash in place of list in look_at # Call analyse in each text call in place of global var TEXT # Update POD doc # # Revision 1.25 2002/12/29 14:35:10 alian # Some minor fixe suite to last update # # Revision 1.24 2002/12/29 03:18:37 alian # Update POD documentation # # Revision 1.23 2002/12/29 00:36:30 alian # Add undef %insite => dangerous global var ... # # Revision 1.22 2002/12/28 22:23:59 alian # Some optimization after bench # # Revision 1.21 2002/12/28 12:36:02 alian # Ajout phase pour ne pas analyser les mots d'un sommaire # # Revision 1.20 2002/12/27 12:55:43 alian # Use ref in analyse, update stopwords use strict; use URI::URL; use URI::WithBase; use DBI; use LWP::RobotUA; use Carp qw/cluck/; use vars qw(@ISA @EXPORT @EXPORT_OK %EXPORT_TAGS $VERSION %links %inside $RM $DESCRIPTION $KEYWORDS $facteur_full_text); require Exporter; @ISA = qw(Exporter); @EXPORT = qw(); $VERSION = ('$Revision: 1.27 $ ' =~ /(\d+\.\d+)/)[0]; # stopwords my %bad = map {$_ => 1} qw ( able about above according across actually after afterwards again against ago all almost already also althought altogether always among amongst and another any anyhow anyone anything anyway anywhere apart are aren around aside away back because been before beforehand behind being below beneath beside besides between beyond but came can cannot come could couldn currently did didn directly does doing don done down downward during each easily else elsewhere enough especially even ever every everybody everyone everything everywhere exactly except far farther few fewer find five for formerly forth found four frequently from full fully further generally get gets give given going gone gonna got gotten had hardly has have having height hence her here hereafter hereby herein hereupon hers herself him himself his hope how however immediatly including indeed inside instead into inward isn its itself just know largely last lately later latest least leave less lesser let lets like liked likes likewise little lot lower made mainly make making many may maybe means meantime meanwhile might mine more moreover most mostly mrs much must myself namely near necessarily neither never nevertheless nine nobody none nonetheless nor not nothing now nowhere often once one only onto other others otherwise ought our ours ourself ourselves out outside over overall own per perform perharps please previous previously prior probably provide providing quickly quite rather read ready really recently require roughly said same say see sent seven several shall shan she should shouldn simply since six slightly some somebody somehow someone something sometime sometimes somewhat somewhere soon still strictly such take ten than thanks that the their theirs them themselves then thence there thereafter thereby therefore therein thereupon these they think this those though three through thru thus thusly timely together too took top toward towards truly two unable under unless unlike unlikely until upon upward upwards use used using usually various very wanna want was wasn well went were weren what whatever when whence whenever where whereabouts whereafter whereas whereby wherefor wherein whereis whereupon wherever whether which whichever while whither who whoever whole whom whomever whose why will with within without worth worthy would wouldn yes yet you your yours yourself yourselves afin ailleurs ainsi ais ait alors aucun aucune aucunes aucuns auparavant auquel assez aussi autour autre autres aux auxquelles avait avant avec avoir beaucoup bien car ceci cela celle celui cependant certain certaine certaines certains ces cet cette ceux chacun chacune chacunes chaque chez cinq combien comme comment contre dans dedans depuis des desquelles desquels deux dire dit dix doit donc dont duquel elle elles encore enfin entre environ est etc eux faire fait faut fit fut huit ici ils jamais laquelle lequel lequels les lesquelles lesquels leur leurs lors lorsque lui maintenant mais mes moi moins mon neuf non nos notre nous ont oui par parce parfois pas peu peut plus plusieurs pour pourquoi pourtant puis quand quant quatre que quel quelconque quelle quelles quelque quelquefois quelques quels qui quoi quoique sans sept ses sinon six soit son sont soudain sous suis sur tandis tant tel telle tels tes toi ton toujours tous tout toute toutes toutefois toutes trois une vers veut voici voir vos votre vous \$ { & com/ + www html htm file/); #------------------------------------------------------------------------------ # new #------------------------------------------------------------------------------ sub new { my $class = shift; my $self = {}; my $indexer = shift; $indexer->trace(5, "Search::Circa::Parser::new\n"); bless $self, $class; $self->{DBH} = $indexer->{DBH}; $self->{ConfigMoteur} = $indexer->{ConfigMoteur}; while (my ($n,$v)=each(%{$indexer->{ConfigMoteur}})) { $indexer->trace(4, "\t$n => $v"); } $self->{INDEXER} = $indexer; $facteur_full_text = $self->{ConfigMoteur}->{'facteur_full_text'}; # Ce module n'est presque jamais installé ! # Evidemment cela demande une charge machine et un .so # compilé pour cet environnement. Ca fait peur aux admin # ISP ! On encapsule donc l'appel, si on echoue, on previent que # tout appel au parser se soldera par une utilisation d'un parseur basic # sans handicaper le reste de l'application # Il vous reste plus qu'a faire alors une install mysql/circa en local # pour faire l'indexation, et exporter les resultats sur le serveur final. $self->{_parser_ok}=1; eval { require HTML::Parser }; if ($@ || $HTML::Parser::VERSION < 3.0) { warn "Module HTML-Parser 3.0 ou superieur requis pour ". "utiliser les fonctionnalités optimales du parser.($@)\n"; $self->{_parser_ok}=0; } else { use HTML::Entities; } $self->{INDEXER}->trace(1,"Parser::new"); return $self; } #------------------------------------------------------------------------------ # tag #------------------------------------------------------------------------------ sub tag { my($tag, $num, $att) = @_; # parametre # Liens exterieurs if ((lc($tag) eq 'a') and ($$att{href})) {$links{$$att{href}}=1;} # Frame elsif ((lc($tag) eq 'frame') and ($$att{src})) {$links{$$att{src}}=1;} # On est dans le cas d'un meta elsif (lc($tag) eq 'meta' and defined(%$att)) { if ($$att{name} and lc($$att{name}) eq 'description') {# Description $DESCRIPTION =$$att{content};} elsif ((lc($$att{'http-equiv'}) eq 'keywords') or (lc($$att{name}) eq 'keywords')) {# Mots-clefs $KEYWORDS=$$att{content} ;} elsif ((lc($$att{'http-equiv'}) eq 'refresh') and ($$att{content}=~/\d*;URL=(.*)$/)) {#url refresh $links{$1}=1; } } # Area elsif (($tag eq 'area') and ($$att{href})) {$links{$$att{href}}=1;} $inside{$tag} += $num; # Profondeur de la balise } #------------------------------------------------------------------------------ # text #------------------------------------------------------------------------------ sub text { return if $inside{script} || $inside{style}; analyse($_[0], $facteur_full_text); } #------------------------------------------------------------------------------ # look_at #------------------------------------------------------------------------------ sub look_at { my($this, $rh)=@_; # $url,$idc,$idr,$lastModif,$url_local,$categorieAuto,$niveau,$categorie undef %links; $RM={}; undef %inside; $rh->{niveau} = 0 if (!$rh->{niveau}); $rh->{categorie} = 0 if (!$rh->{categorie}); my $buf_debug = "\tUrl => $rh->{url}\n\tIdc => $rh->{idc}\n"; $buf_debug.= "\tLast update => $rh->{lastModif}" unless (!defined($rh->{lastModif})); $buf_debug.= "\tUrl local => $rh->{url_local}" unless (!defined($rh->{url_local})); $this->{INDEXER}->trace(3, "Parser::look_at\n$buf_debug"); my ($url_orig,$racineFile,$racineUrl,$lastUpdate); if ($rh->{url_local} or URI->new($rh->{url})->scheme eq 'file') { $this->set_agent(1); } else { $this->set_agent(0); } if ($rh->{url_local}) { $this->{ConfigMoteur}->{'temporate'}=0; if ($rh->{url_local}=~/.*\/$/) { chop($rh->{url_local}); if (-e "$rh->{url_local}/index.html") { $rh->{url_local}.="/index.html";} elsif (-e "$rh->{url_local}/index.htm") { $rh->{url_local}.="/index.htm";} elsif (-e "$rh->{url_local}/default.htm") { $rh->{url_local}.="/default.htm";} else {return (-1,0,0);} } $url_orig=$rh->{url}; $rh->{url}=$rh->{url_local}; ($racineFile,$racineUrl) = $this->{INDEXER}->fetch_first("select path,url from ". $this->{INDEXER}->pre_tbl."local_url ". "where id=$rh->{idr}"); } my ($nb,$nbwg,$nburl)=(0,0,0); if ($rh->{url_local}) {$this->{INDEXER}->set_host_indexed($rh->{url_local});} else {$this->{INDEXER}->set_host_indexed($rh->{url});} # Creation d'une requete # On passe la requete à l'agent et on attend le résultat my $res = $this->{AGENT}->request(new HTTP::Request('GET' => $rh->{url})); $this->{INDEXER}->trace(2, "HTTP::Request return ".$res->status_line); if ($res->is_success) { # Langue my $language = $res->content_language || 'unkno'; if ($rh->{lastModif}) { $this->{INDEXER}->trace(2,"Update url ".$rh->{lastModif}.' '. $res->last_modified); } # Fichier non modifie depuis la derniere indexation if (($rh->{lastModif}) && ($res->last_modified) && ($rh->{lastModif} >= $res->last_modified)) { $this->{INDEXER}->trace(1,"No update on $rh->{url}"); $this->{INDEXER}->URL->update ($rh->{idr},('id'=>$rh->{idc}, 'last_check'=>"NOW()")); return (0,0,0); } if ($res->last_modified) { my @date = localtime($res->last_modified); $lastUpdate = ($date[5]+1900).'-'.($date[4]+1).'-'. $date[3].' '.$date[2].':'.$date[1].':'.$date[0]; } else {$lastUpdate='0000-00-00';} my $x = 72-length($rh->{url}); if ( $this->{inindex}) { print $this->{inindex},'/',$this->{toindex}," ", $rh->{url},($ENV{SERVER_NAME} ? "<br>\n" : (" "x$x)."\n"); } # Il serait judicieux de mettre ca dans le constructeur, # mais cela entraine 10 Mo de Ram supplementaire à # l'utilisation. A voir avec les evolution du module # HTML::Parser if ($this->{_parser_ok}) { $this->{INDEXER}->trace(3,"Use HTML::Parser ..."); my $parser = HTML::Parser->new (api_version => 3, handlers => [start => [\&tag, "tagname, '+1', attr"], end => [\&tag, "tagname, '-1', attr"], text => [\&text, "dtext"], ], marked_sections => 1); # parse du fichier $parser->parse($res->content) || print STDERR "Can't parse ".$res->content."::$!\n"; } else { $this->{INDEXER}->trace(1,"Use a basic parser ..."); my $TEXT = $res->content; $TEXT=~s{ <! (.*?) (--.*?--\s*)+(.*?)> } { if ($1 || $3) {"<!$1 $3>";} }gesx; $TEXT=~s{ <(?: [^>\'\"] * | ".*?" | '.*?' ) + > }{}gsx; analyse(decode_entities($TEXT), $this->{ConfigMoteur}->{'facteur_full_text'}); } # Mots clefs et description my ($desc,$keyword)=($DESCRIPTION||' ',$KEYWORDS||' '); undef $DESCRIPTION; undef $KEYWORDS; my $titre = $res->title || $rh->{url};# Titre # Categorie if ($rh->{categorieAuto}) { $rh->{categorie} = $this->{INDEXER}->categorie->get($rh->{url}, $rh->{idr}); } if (!$rh->{categorie}) {$rh->{categorie}=0;} # Mis a jour de l'url if ($this->{INDEXER}->URL->update ($rh->{idr}, (parse => 1, id => $rh->{idc}, titre => $titre, description => $desc, last_update => $lastUpdate, last_check => 'NOW()', langue => $language, categorie => $rh->{categorie} ) )) { $this->{INDEXER}->trace(2, "$rh->{url} mis à jour avec success"); } # Traitement des mots trouves analyse($keyword,$this->{ConfigMoteur}->{'facteur_keyword'}); analyse($desc,$this->{ConfigMoteur}->{'facteur_description'}); analyse($titre,$this->{ConfigMoteur}->{'facteur_titre'}); analyse($rh->{url},$this->{ConfigMoteur}->{'facteur_url'}); $this->{INDEXER}->dbh->do ("delete from ".$this->{INDEXER}->pre_tbl.$rh->{idr}."relation ". "where id_site = $rh->{idc}"); # Chaque mot trouve plus de $ConfigMoteur{'nb_min_mots'} fois # est enregistre # On passe cette etape si le nombre de liens de la page est superieur # a 50% le nombre de mots retenus, il s'agit alors # d'un sommaire peut interessant à consulter my $nbw = 0; if (scalar keys %links < (( scalar keys %$RM) * 0.5)) { while (my ($mot,$nb)=each(%$RM)) { next if (!$nb or $nb < $this->{'ConfigMoteur'}->{'nb_min_mots'}); my $requete = "insert into ". $this->{INDEXER}->pre_tbl.$rh->{idr}. "relation (mot,id_site,facteur) ". "values ('$mot',$rh->{idc},$nb)"; $this->{INDEXER}->dbh->do($requete) && $nbwg++; $this->{INDEXER}->trace(4,"\t\tStore words: ".$requete); } $nbw=keys %$RM; } else { $this->{INDEXER}->trace (1,"Sommaire - ".(scalar keys %$RM). " mots ignores pour ".(scalar keys %links)." liens"); } # On n'indexe pas les liens si on est au niveau max if ($rh->{niveau} == $this->{ConfigMoteur}{'niveau_max'}) { $this->{INDEXER}->trace(1,"Niveau max atteint. Liens suivants de ". "cette page ignorés<br>"); return (0,0,0); } # Traitement des url trouves my $base = $res->base; my @l = keys %links; undef %links; $this->{INDEXER}->trace(2, "Liens trouvés") if ($#l>0); foreach my $var (@l) { $var = url($var,$base)->abs; # Url absolu $var = $this->check_links('a',$var); if (($rh->{url_local}) && ($var)) { my $urlb = $var; $urlb=~s/$racineFile/$racineUrl/g; #print h1("Ajout site local:$$var[2] pour $racineFile"); $this->{INDEXER}->trace(2, "\t".$urlb); if ($this->{INDEXER}->URL->add ($rh->{idr}, (url => $urlb, urllocal => $var, niveau => $rh->{niveau}+1, categorie => $rh->{categorie}, valide => 1, browse_categorie=>$rh->{categorieAuto}))) { $nburl++; } else {$this->{INDEXER}->trace (2,"\tCan't add $urlb:\n\t$DBI::errstr");} } elsif ($var) { $this->{INDEXER}->trace(2, "\t".$var); if ($this->{INDEXER}->URL->add ($rh->{idr}, (url => $var, niveau => $rh->{niveau}+1, categorie => $rh->{categorie}, valide => 1))) { $nburl++; } else { $this->{INDEXER}->trace (2,"\tCan't add $var:\n\t$DBI::errstr");} } } $this->{INDEXER}->trace(3, "---------------------------------\n"); return ($nburl,$nbw,$nbwg); } # Sinon previent que URL defectueuse else { print "*** ", $res->code," : $rh->{url}\n";return (-1,0,0);} } #------------------------------------------------------------------------------ # set_agent #------------------------------------------------------------------------------ sub set_agent { my ($self,$locale)=@_; $self->{INDEXER}->trace(5, "Circa::Parser::set_agent $locale\n"); return if ($self->{AGENT} && $self->{_ROBOT}==$locale); # agent already set $self->{_ROBOT}=$locale; if (($self->{ConfigMoteur}->{'temporate'}) && (!$locale)) { $self->{'AGENT'} = LWP::RobotUA->new ("CircaParser $VERSION",$self->{'ConfigMoteur'}->{'author'}); $self->{AGENT}->delay(1/120.0); } else {$self->{AGENT} = new LWP::UserAgent; } if ($self->{PROXY}) {$self->{AGENT}->proxy(['http', 'ftp'], $self->{PROXY});} $self->{AGENT}->max_size($self->{INDEXER}->size_max) if ($self->{INDEXER}->size_max); $self->{AGENT}->timeout(25); # Set timeout to 25s (defaut 180) } #------------------------------------------------------------------------------ # analyse #------------------------------------------------------------------------------ sub analyse { my $data = shift; my $facteur = shift; my $e; return if (!$data or !$facteur); # Ponctuation et mots recurents $data=~s/http:\/\// /gm; $data=~tr/\n\t<>.;:,?!()\"\'[]#=\/_/ /; $data=~s/\s+/ /gm; foreach (split(/\s/,$data)) { next if !$_; $e=lc($_); $$RM{$e}+=$facteur if (($e =~/\w/)&&(length($e)>2)&&(!$bad{$e})&&($e !~/^\d*$/)); } } #------------------------------------------------------------------------------ # check_links #------------------------------------------------------------------------------ sub check_links { my($self,$tag,$links) = @_; my $host = $self->{INDEXER}->host_indexed; my $li = "doc|zip|ps|gif|jpg|gz|pdf|eps|png|deb|xls|ppt|". "class|GIF|css|js|wav|mid"; my $bad = qr/\.($li)$/i; if (($tag) && ($links) && ($tag eq 'a') && ($links=~/\Q$host\E/) && ($links !~ $bad)) { if ($links=~/^(.*?)#/) {$links=$1;} # Don't add anchor if ((!$self->{ConfigMoteur}->{'indexCgi'})&&($links=~/^(.*?)\?/)) {$links=$1;} return $links; } return 0; } #------------------------------------------------------------------------------ # POD DOCUMENTATION #------------------------------------------------------------------------------