|
|
[Perlfect-search] data dump script (and xml)
will will@spanner.org
Sun, 6 May 2001 23:37:16 +0100
>On Friday 27 April 2001 01:12, you wrote:
>
>> the way perlfect works is really very good for indexing xml files in
>> a simple way: all you have to do is change the parsing for title and
>> description and body to the relevant xml fields and you get a nice,
>> if limited, low-overhead free-text search of your xml data. no expat,
>> no nasty anything. very impressed. is that something that should be
>> developed? might be able to help, if so.
>
>That sounds very useful. It looks like all we need is two configuration
>options "TITLE" and "BODY" (maybe "DESCRIPTION", too), which default to
>"title" and "body". If you have a patch already, please send it. If not I
>will add this to my version 3.21 TODO list.
i don't have a patch but i'd be happy to (learn how to) make one.
it's an interesting possibility, i agree: the strength of your
indexing would combine well with the added structure of xml, and it's
very simple to teach the indexer to read in data according to xml
rather than html conventions. the best, and most distinctive, thing
about it for me is that it allows html and xml (and pdf) to sit side
by side in the same index and come under the same weighting and
searching system without ever really caring about the data format.
there are some issues to consider, though:
* this isn't very good xml practice. You can confidently default to
<title> for the title field, but a valid xml file could easily
contain several <title> fields - one for the main document and
several for its constituent parts - and you can't assume that the
first one is the main one. The best thing would be to write another
separate file - indexer_xml.pl - and use the equivalent of your
(clever) db_file arrangement to use an xml parser if it's there and
just read text if not. parsing xml is hard, and field order is not
significant apart from nesting.
easier but less satisfactory; you could just tell people that you're
going to read the first title field so they'd better sort out the
field order themselves. some would frown.
* xml files are rarely browsed directly, because only explorer 5+ can
do it and nobody understands xsl, least of all me. What often
happens, i believe, is that the xml file is read by a script and
delivered to the user by way of a template (which may or may not be
xsl). Which means that you need some url translation rules in the
system to map indexer paths to public urls.
which is ok: it's only a couple of lines of code to say 'if xml, do
this, otherwise record url like so'. the problem is that you'd
probably have to stop using arbitrary id numbers, or at least change
the order of data writes so that if an id number is found in the
document then that is used, otherwise you use something arbitrary but
more unique than just a number. in many cases i expect the id or
other significant field that you'd have to use to retrieve the xml
file from the delivery system is contained in the file itself, not
the filename.
but again, you could just rule this out and say 'we only deal with
file paths: if you've got cgi going on then you integrate it'.
and i could easily be very wrong about this part.
* there's no reason to confine yourself to title and description. It
makes sense with html, since that's all you can reliably retrieve
from the file, but the whole point of xml (for me, at least) is to
store structured data (on low-tech systems). I've added a few more
indexes to store document type, author and other salient fields so
that those can be displayed on the results pages, and searched
against, of course.
So I'd suggest that you parameterise the data sets. The core data
would remain the same, of course (inv_index, addresses, titles and
descriptions), but you could allow people to add whatever fields they
want to record and return, and what they want to keep there. It would
actually make the code simpler, too, to keep a hash of data files (in
the form databasename => 'fieldname', i guess) in conf.pl, and loop
over it when you're tying data files, renaming them, parsing pages or
feeding data to the Template.pm.
the template complicates things, though: applying the rules you use
to format data before templating would be much uglier if you were
looping on a hash rather than specifying each time. It would be quite
easy to keep the rules you've got at the moment and key them to
specific field names - if ($fieldname eq 'date') { format
appropriately before passing} and so on - but it would be very hard
to extend that capability to user-defined fields.
this needn't just apply to xml, but the benefit would be much less
immediate with plain html.
you could even let people dictate the regex that they want to use to
retrieve data for each field, and the code they want to run it
through before it is displayed, but it would be a great shame to lose
the approachability of the current script and having slots where
people can write perl would probably do that.
* all of which added complexity would need to be reflected in the
setup script, although an early 'read xml too' toggle would keep it
from the eyes of the uninterested.
so you might end up with quite broad changes to the basic structures
of the search if you did the xml thing properly. Some of them might
be useful anyway, i suppose, and I think it would be a very valuable
product. xml is still black magic to most people: a friendly freeware
indexer would be well received, especially if it did things in the
proper way (xml evangelists can be _very_ pious).
I would be interesting in helping to put this together, if that would
be useful, but i'll sympathise if you decide that it's too much of a
diversion from your main goal.
it would also be possible to use your indexing, spidering and
retrieval code but create a separate, related xml product. Some
thought would have to go into what was shared and what was separate,
and it would be a shame to lose the ability to search through files
of different types at the same time and within the same relevance
calculation.
i hope all this makes sense: it's all written in a bit of a rush.
wdyt?
best
will
--
pgpkey: http://www.spanner.org/keys/will.txt
|
|