Perlfect Solutions
 

[Perlfect-search] FIX: file size reporting for PDFs

Cameron Moore lists@toad.bitstreet.net
Thu, 25 Oct 2001 00:11:18 -0500
Hello,
I noticed the other day that the file size of PDFs is being reported
incorrectly.  In indexer.pl (around line 221 of the CVS version), you
have:

  $sizes_db{$doc_id} = length(${$buffer});

Well, that saves the size of the ASCII text we get back from pdftotext,
not the actual file size.  To fix this I made the following change:

  # remember document size
  if (! $HTTP_START_URL && $url =~ /\.pdf$/i) {
    $sizes_db{$doc_id} = -s $url;
  } else {
    $sizes_db{$doc_id} = length(${$buffer});
  }

I haven't attempted to make this work for the web indexer.  I *think*
something like this might work (watch your step: I jump around a bit
here):

  # change parse_pdf to accept the doc_id
  parse_pdf($file, $doc_id, \$buffer);
...
  # in parse_pdf(), save the size of the temp file
  $sizes_db{$doc_id} = -s $tmpfile;
...
  # and back to line 221 of indexer.pl
  # save the size as long as the size isn't already saved
  $sizes_db{$doc_id} = length(${$buffer}) unless $sizes_db{$doc_id};

This should take care of both indexing methods In One Easy Step(tm).
Looks like this will require some changes to indexer_web.pl in order to
get the doc_id before it's sent to parse_pdf...but I don't use the web
indexer, so I'll let someone else mess with that.  ;-)
-- 
Cameron Moore