Perlfect Solutions

[Perlfect-search] ranking long vs. short documents

Mon, 14 Aug 2000 10:50:40 +0300
hmmm, what you propose is interesting but hink of that:

i have a pararagh called A and a paragraph called B both is separate
documents. suppose that a search for X returns document A. the i make an
archive of my pages so i create a new document C with 2 paragraphs, A
and B. Now if I search for X, isn't document C exactly as important with
respect to that term(X), as document A?

another funny situation: i have document A as above. i copy and paste it
10 times into document D. don't A and D contain exactly the same

finally a situation which occurs quite often where the longer document
may be better: imagine a documents I. I is an index containing a list of
links to other documents. In this other document the terms that are
linked are explained in further detail. the second document could easily
be much larger that the document I. common examples: dictionary, list of
products, table of contents, etc.

the bottom line is that: although a shorter document may be more concise
is respect to the keyword searched for, it may in the end contain
exactly the same information since we are assuming that both the long
and the short document have exacty the same number of occurences of the

in my opinion, the easiest solution to this problem is, not to tamper
with the weight calculation but instead simple display the document size
next to each result and then the user may pick the they prefer.

Daniel Naber wrote:
> Hi,
> currently the score of a match is influenced only by the position and the
> number of occurences of a term. Shouldn't the length of the document also
> play a role? If a word occurs twice in a short document, isn't that more
> relevant than twice in a very long document?
> Something like this:
>       $faktor = ($tdf{$doc_id}/$size+0.5);
>       $weight = $tdf{$doc_id} * $faktor * log($DN / $df);
> (0.5 ist just some trial'n'error value)
> Regards
>  Daniel
> _______________________________________________
> perlfect-search mailing list