Perlfect Solutions

[Perlfect-search] FIX: file size reporting for PDFs

Cameron Moore
Thu, 25 Oct 2001 00:11:18 -0500
I noticed the other day that the file size of PDFs is being reported
incorrectly.  In (around line 221 of the CVS version), you

  $sizes_db{$doc_id} = length(${$buffer});

Well, that saves the size of the ASCII text we get back from pdftotext,
not the actual file size.  To fix this I made the following change:

  # remember document size
  if (! $HTTP_START_URL && $url =~ /\.pdf$/i) {
    $sizes_db{$doc_id} = -s $url;
  } else {
    $sizes_db{$doc_id} = length(${$buffer});

I haven't attempted to make this work for the web indexer.  I *think*
something like this might work (watch your step: I jump around a bit

  # change parse_pdf to accept the doc_id
  parse_pdf($file, $doc_id, \$buffer);
  # in parse_pdf(), save the size of the temp file
  $sizes_db{$doc_id} = -s $tmpfile;
  # and back to line 221 of
  # save the size as long as the size isn't already saved
  $sizes_db{$doc_id} = length(${$buffer}) unless $sizes_db{$doc_id};

This should take care of both indexing methods In One Easy Step(tm).
Looks like this will require some changes to in order to
get the doc_id before it's sent to parse_pdf...but I don't use the web
indexer, so I'll let someone else mess with that.  ;-)
Cameron Moore