Perlfect Solutions
 

[Perlfect-search] advanced exclude?!

will will@spanner.org
Fri, 8 Jun 2001 19:32:55 +0100
>On Thursday 07 June 2001 09:56, you wrote:
>
>>  Is there a way to do indexing once including this dir in data base to be
>>   searched and then just reindexing the rest of the site every night?
>
>No. But there's a certain chance that version 3.21 will have faster
>indexing for large collections.

that's because you operate on the whole document collection to 
calculate the weights?

it occurs to me that if you didn't unlink $TF_DB_FILE at the end of 
the run, you could keep it as a state file. then when someone 
requests an incremental index - a new directory or the reindexing of 
a frequently-updated part of their site - you could start by 
preloading %tf_db with the values from last time the indexer ran. The 
difference would be invisible to crawl_whatever(), i think: it would 
just append the data from the new files and then you could 
recalculate the weightings for everything.

i'm trying to think of disadvantages. Writing the tf data file would 
be a pain if the low memory flag wasn't set, and I guess it means 
indexer.pl would have to take parameters on the command line. you'd 
probably also need a threshold beyond which a full reindex was 
strongly suggested, like fsck. and i guess the file could be rather 
large?

but i think it would work transparently for existing files: there 
would be two (doc_id => incidence) pairs for that file in %tf_db, but 
the older one would be overwritten before weighting, where you unpack 
%tf_db{$term_id}. that doesn't help for the tied file, of course.

i'm just avoiding work here. I hope you don't mind the impertinent suggestions.

will


ps. is this the right place for this sort of note?
-- 











pgpkey: http://www.spanner.org/keys/will.txt