Combine::SD_SQL - Combine::SD_SQL documentation


Combine documentation  | view source Contained in the Combine distribution.

Index


NAME

Top

SD_SQL

DESCRIPTION

Top

Reimplementation of sd.pl SD.pm and SDQ.pm using MySQL contains both recyc and guard

Basic idea is to have a table (urldb) that contains most URLs ever inserted into the system together with a lock (the guard function) and a boolean harvest-flag. Also in this table is the host part together with its lock. URLs are selected from this table based on urllock, netloclock and harvest and inserted into a queue (table que). URLs from this queue are then given out to harvesters. The queue is implemented as: # The admin table can be used to generate sequence numbers like this: #mysql> update admin set queid=LAST_INSERT_ID(queid+1); # and used to extract the next URL from the queue #mysql> select host,url from que where queid=LAST_INSERT_ID(); # When the queue is empty it is filled from table urldb. Several different algorithms can be used to fill it (round-robin, most urls, longest time since harvest, ...). Since the harvest-flag and guard-lock are not updated until the actual harvest is done it is OK to delete the queue and regenerate it anytime.

########################## #Questions, ideas, TODOs, etc #Split table urldb into 2 tables - one for urls and one for hosts??? #Less efficient when filling que; more efficient when updating netloclock #Datastruktur TABLE hosts: create table hosts( host varchar(50) not null default '', netloclock int not null, retries int not null default 0, ant int not null default 0, primary key (host), key (ant), key (netloclock) );

############# Handle to many retries?

    algorithm takes an url from the host that was accessed longest ago
    ($hostid,$url)=SELECT host,url,id FROM hosts,urls WHERE 
	 hosts.hostlock < UNIX_TIMESTAMP()
	 hosts.host=urls.host AND 
         urls.urllock < UNIX_TIMESTAMP() AND 
	 urls.harvest=1 ORDER BY hostlock LIMIT 1;

    algorithm takes an url from the host with most URLs
    ($hostid,$url)=SELECT host,url,id FROM hosts,urls WHERE 
	 hosts.hostlock < UNIX_TIMESTAMP()
	 hosts.host=urls.host AND 
         urls.urllock < UNIX_TIMESTAMP() AND 
	 urls.harvest=1 ORDER BY host.ant DESC LIMIT 1;

    algorithm takes an url from any available host
    ($hostid,$url)=SELECT host,url,id FROM hosts,urls WHERE 
	 hosts.hostlock < UNIX_TIMESTAMP()
	 hosts.host=urls.host AND 
         urls.urllock < UNIX_TIMESTAMP() AND 
	 urls.harvest=1 LIMIT 1;

AUTHOR

Top

Anders Ardö <anders.ardo@it.lth.se>

COPYRIGHT AND LICENSE

Top


Combine documentation  | view source Contained in the Combine distribution.