Perlfect Solutions
 

[Perlfect-search] ranking long vs. short documents

giorgos giorgos@perlfect.com
Fri, 18 Aug 2000 20:19:29 +0300
Deniz Sarikaya wrote:
> 
> giorgos wrote:
> >
> > the bottom line is that: although a shorter document may be more concise
> > is respect to the keyword searched for, it may in the end contain
> > exactly the same information since we are assuming that both the long
> > and the short document have exacty the same number of occurences of the
> > keyword(s).
> >
> > in my opinion, the easiest solution to this problem is, not to tamper
> > with the weight calculation but instead simple display the document size
> > next to each result and then the user may pick the they prefer.
> 
> In my opinion, the bottom line is that context is king. My ideal
> solution would be to quote the lines in which the keyword(s) appear
> along with each search hit. I do not know how much of a performance/size
> hit this would entail, nor how much extra programming it would entail. I
> guess we'd store line numbers along with documents in the hashes?
> 
> Maybe it's an option which could be enabled in the conf.pl, and turned
> off by default.
> 
> Also, is there any way we could enable the user specifying boolean AND
> and OR? I know the engine defaults to OR, but shouldn't the person be
> able to stick an AND in to manually override it? I know we can achieve
> the same effect using +, so AND shouldn't be that hard to stick in.

Well it depends on how you handle the situation... There are two ways:

1. Build a huge index as daniel suggested that included all the text
from each document. This would effectively duplicate the content of your
site. If your site is a few hundred document no problem! Using this
index, when a query is executed mark the appropriate context by
retrieving it from the database(file). This would slow down the indexer
quite a bit, and the search engine a little but.

2. After a query is executed open the files that are in the results and
retrive their content; then mark the context. This would effectively
mean open about 10 files and reading them in eveyr query executes. It
would not have any effect on the indexer but would slow down the search
engine part quite a bit.

If I were to build this feature I would implement both approaches and
let the user pick. It seems to me that each of the two approaches above
corresponds to completely different needs...

Anyone have any better proposal on a way refine our approach above?
Perhaps is some way summarize the content of the html files and index
only that... I know there is a Perl module that does exactly
that(generate summaries of HTML files) and if I am not mistaken it is
called HTML::Summary(wow, i wouldn't have guessed).

Let me know what you think...

giorgos.