Perlfect Solutions

[Perlfect-search] BUG: get_* and non-HTML files

Cameron Moore
Fri, 26 Oct 2001 20:12:09 -0500
* [2001.10.26 14:07]:
> On Thursday 25 October 2001 07:28, you wrote:
> > In get_cleaned_body(), you have the following:
> >
> >   my ($cleaned) = (${$buffer} =~ m/<BODY.*?>(.*)<\/BODY>/is);
> >   $cleaned = ${$buffer} if( ! $cleaned );       # PDF files don't have a
> > <body>
> >
> > How can you assume that a PDF file will not have the text matched by
> > that regex?  You can't.
> Can you try this patch? I cannot test it, as I cannot produce any PDFs 
> like that (any of my PDFs don't work with pdftotext).

I'll test it out as soon as I get anon access to the CVS server.  :-)
After a quick look, I think your changes will work fine.  The only
change I would make is to reverse the logic of isPDF().  IMO, it would
be better to check if isHTML.  The isPDF will be outgrown if you want to
start parsing .DOC files or other non-HTML files.  I'd do something

  # File extensions with HTML markup that should be ignored
  @HTML_EXT = ("htm","html","shtml");

  # Regular Expression that we only want to build once
  $HTML_EXT_REGEX = '('.join('|',@HTML_EXT).')';

  sub isHTML {
    return ( $_[0] =~ m/\.$HTML_EXT_REGEX$/io );

Then reverse the logic whereever isPDF was used.  Make sure to use m//o,
so we only build the pattern once.  That would be a longer lasting
solution, I think.  :-)
Cameron Moore