Perlfect Solutions

[Perlfect-search] BUG: get_* and non-HTML files

Cameron Moore
Thu, 25 Oct 2001 00:28:23 -0500
I'm sure this is not secret to the developers, but I've noticed that
most of the indexing code makes some blind assumptions that break down
when indexing PDFs, TXTs, and other non-HTML files.

Exhibit A:

In get_cleaned_body(), you have the following:

  my ($cleaned) = (${$buffer} =~ m/<BODY.*?>(.*)<\/BODY>/is);
  $cleaned = ${$buffer} if( ! $cleaned );       # PDF files don't have a <body>

How can you assume that a PDF file will not have the text matched by
that regex?  You can't.  The only way I know to deal with this is to
reorganize some of the code is a few places and treat HTML files
differently from other files, ie. don't call any of the get_* subs on a
simple text buffer.

Note that I haven't tested this bug.  I've only looked through the code.
I didn't see any mention of this issue in the TODO list of the CHANGES
file, so I thought I'd bring it up.  Thanks
Cameron Moore