|
|
[Perlfect-search] FIX: file size reporting for PDFs
Cameron Moore lists@toad.bitstreet.net
Thu, 25 Oct 2001 00:11:18 -0500
Hello,
I noticed the other day that the file size of PDFs is being reported
incorrectly. In indexer.pl (around line 221 of the CVS version), you
have:
$sizes_db{$doc_id} = length(${$buffer});
Well, that saves the size of the ASCII text we get back from pdftotext,
not the actual file size. To fix this I made the following change:
# remember document size
if (! $HTTP_START_URL && $url =~ /\.pdf$/i) {
$sizes_db{$doc_id} = -s $url;
} else {
$sizes_db{$doc_id} = length(${$buffer});
}
I haven't attempted to make this work for the web indexer. I *think*
something like this might work (watch your step: I jump around a bit
here):
# change parse_pdf to accept the doc_id
parse_pdf($file, $doc_id, \$buffer);
...
# in parse_pdf(), save the size of the temp file
$sizes_db{$doc_id} = -s $tmpfile;
...
# and back to line 221 of indexer.pl
# save the size as long as the size isn't already saved
$sizes_db{$doc_id} = length(${$buffer}) unless $sizes_db{$doc_id};
This should take care of both indexing methods In One Easy Step(tm).
Looks like this will require some changes to indexer_web.pl in order to
get the doc_id before it's sent to parse_pdf...but I don't use the web
indexer, so I'll let someone else mess with that. ;-)
--
Cameron Moore
|
|