Perlfect Solutions

[Perlfect-search] data dump script (and xml)

Sun, 6 May 2001 23:37:16 +0100
>On Friday 27 April 2001 01:12, you wrote:
>>  the way perlfect works is really very good for indexing xml files in
>>   a simple way: all you have to do is change the parsing for title and
>>   description and body to the relevant xml fields and you get a nice,
>>   if limited, low-overhead free-text search of your xml data. no expat,
>>   no nasty anything. very impressed. is that something that should be
>>   developed? might be able to help, if so.
>That sounds very useful. It looks like all we need is two configuration
>options "TITLE" and "BODY" (maybe "DESCRIPTION", too), which default to
>"title" and "body". If you have a patch already, please send it. If not I
>will add this to my version 3.21 TODO list.

i don't have a patch but i'd be happy to (learn how to) make one.

it's an interesting possibility, i agree: the strength of your 
indexing would combine well with the added structure of xml, and it's 
very simple to teach the indexer to read in data according to xml 
rather than html conventions. the best, and most distinctive, thing 
about it for me is that it allows html and xml (and pdf) to sit side 
by side in the same index and come under the same weighting and 
searching system without ever really caring about the data format.

there are some issues to consider, though:

* this isn't very good xml practice. You can confidently default to 
<title> for the title field, but a valid xml file could easily 
contain several <title> fields - one for the main document and 
several for its constituent parts - and you can't assume that the 
first one is the main one. The best thing would be to write another 
separate file - - and use the equivalent of your 
(clever) db_file arrangement to use an xml parser if it's there and 
just read text if not. parsing xml is hard, and field order is not 
significant apart from nesting.

easier but less satisfactory; you could just tell people that you're 
going to read the first title field so they'd better sort out the 
field order themselves. some would frown.

* xml files are rarely browsed directly, because only explorer 5+ can 
do it and nobody understands xsl, least of all me. What often 
happens, i believe, is that the xml file is read by a script and 
delivered to the user by way of a template (which may or may not be 
xsl). Which means that you need some url translation rules in the 
system to map indexer paths to public urls.

which is ok: it's only a couple of lines of code to say 'if xml, do 
this, otherwise record url like so'. the problem is that you'd 
probably have to stop using arbitrary id numbers, or at least change 
the order of data writes so that if an id number is found in the 
document then that is used, otherwise you use something arbitrary but 
more unique than just a number. in many cases i expect the id or 
other significant field that you'd have to use to retrieve the xml 
file from the delivery system is contained in the file itself, not 
the filename.

but again, you could just rule this out and say 'we only deal with 
file paths: if you've got cgi going on then you integrate it'.

and i could easily be very wrong about this part.

* there's no reason to confine yourself to title and description. It 
makes sense with html, since that's all you can reliably retrieve 
from the file, but the whole point of xml (for me, at least) is to 
store structured data (on low-tech systems). I've added a few more 
indexes to store document type, author and other salient fields so 
that those can be displayed on the results pages, and searched 
against, of course.

So I'd suggest that you parameterise the data sets. The core data 
would remain the same, of course (inv_index, addresses, titles and 
descriptions), but you could allow people to add whatever fields they 
want to record and return, and what they want to keep there. It would 
actually make the code simpler, too, to keep a hash of data files (in 
the form databasename => 'fieldname', i guess) in, and loop 
over it when you're tying data files, renaming them, parsing pages or 
feeding data to the

the template complicates things, though: applying the rules you use 
to format data before templating would be much uglier if you were 
looping on a hash rather than specifying each time. It would be quite 
easy to keep the rules you've got at the moment and key them to 
specific field names - if ($fieldname eq 'date') { format 
appropriately before passing} and so on - but it would be very hard 
to extend that capability to user-defined fields.

this needn't just apply to xml, but the benefit would be much less 
immediate with plain html.

you could even let people dictate the regex that they want to use to 
retrieve data for each field, and the code they want to run it 
through before it is displayed, but it would be a great shame to lose 
the approachability of the current script and having slots where 
people can write perl would probably do that.

* all of which added complexity would need to be reflected in the 
setup script, although an early 'read xml too' toggle would keep it 
from the eyes of the uninterested.

so you might end up with quite broad changes to the basic structures 
of the search if you did the xml thing properly. Some of them might 
be useful anyway, i suppose, and I think it would be a very valuable 
product. xml is still black magic to most people: a friendly freeware 
indexer would be well received, especially if it did things in the 
proper way (xml evangelists can be _very_ pious).

I would be interesting in helping to put this together, if that would 
be useful, but i'll sympathise if you decide that it's too much of a 
diversion from your main goal.

it would also be possible to use your indexing, spidering and 
retrieval code but create a separate, related xml product. Some 
thought would have to go into what was shared and what was separate, 
and it would be a shame to lose the ability to search through files 
of different types at the same time and within the same relevance 

i hope all this makes sense: it's all written in a bit of a rush.