Perlfect Solutions

[Perlfect-search] update the index?

Sat, 12 Aug 2000 17:57:20 +0300
Daniel Naber wrote:
> On Sam, 12 Aug 2000, you wrote:
> > The only disadvantage of incremental indexing is that you have to keep
> > all hashes stored between indexing operations so it requires a bit more
> > space...
> This is interesting for non-web stuff, too. Maybe anybody has experience
> using Berkeley Hashes from C++? Maybe you can even access Hashes from C++
> that were written from Perl?
> Image a mail client that always has an up-to-date index of all your mails.
> You could have a full text search over thousands of mails in less than one
> second!
> So is adding documents at a later time less efficient compared to adding
> them when the index is created?

What you propose is very interesting indeed. At some point last year I
though of extending Perlfect Search for non web stuff but there was (and
still is) so much on the web front that I abandoned the idea...

What I wanted to do was apply the idea of document clustering to
perlfect search. A document cluster is a set of document that are
related to the same subject. Calculating these clusters given a set of
documents is not easy but I happened to do my msc thesis on clustering
so I got very much into this stuff. If were two build clustering into
perlfect search we would have some very useful features:

1. Given a set of search results we would be able to cluster them and
therefore separate subjects. Imagine a query for "body parts". This
would presumably return documents both about car body parts and human
body parts. The clustering procudere would be able to automatically
separate these two contexts and therefore immediately give a choice to
the user to pick one of the two collections.

2. Clustering can also be applied to keywords. So in searching for body
parts the user could be given a list of other keywords to try. In our
case these could include human and car.

3. We could cluster the whole collection of documents(not only the
search results) and the let the user find "similar document". So then
each search result could have a link next to it saying: "Show me similar
documents" and this would take the user to the cluster that the
particualr document belongs to.

Imagine how useful these features could be even for a mail client.

So shall we build them? :)


PS: Adding document to the index with incremental indexing takes exaclty
the same time like normal indexing usually. However there are special
cases(epsecially when a document has to be removed from the index). But
overall the time taken by the incremental indexer is much much less.

PS2: I am pretty sure that hashes created with DB_File can be accessed
with libdb. Have a look at