Perlfect Solutions
 

[Perlfect-search] Re: File formats

Mark Morgan Lloyd markMLl@telemetry.co.uk
Sun, 01 Oct 2000 22:06:52 +0000
Here's a plain diff of the changes I've made so far to the indexer
script. This operates on indexer.pl as released and my changed version
indexer1.pl, note the first trivial change which has indexer1 get the
remainder of its configuration from conf1.pl. I'm attaching the diff
inline, if anybody wants it as an attachment please email me at the
reply-to address in the header or as in my sig below.

The two changes have the effect (a) of modifying the file regex to
ignore files such as .htaccess and filename.html~ whilst potentially
allowing names with no extension such as are used by INN to store
discussion-group messages, and (b) modifying the behaviour of - when
embedded in a number to allow elements of 'phone numbers to be indexed
rather than the whole thing (I'm sure this can be cleaned up).

This probably completes what I need in indexer for the time being, but
it is my intention to look at the search script so that when a local URL
is being put into a results page to indicate a file it may be optionally
prefixed by (e.g.) ~ so that Apache or whatever can munge the actual
reference internally. As a particular example of this, I've got a link
"newsstore" from my Apache root to the INN root, it's obviously
desirable for Apache to be able to indicate which client is to be used
to view matched files and a fairly easy way of doing this is to force a
404 which calls a script to rewrite the URL from http: to news: where
appropriate... it sounds gruesome but I've checked it in the past and it
works.

-- 
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or
colleagues]


-----8<-----
45c45
< require 'conf.pl';
---
> require 'conf1.pl';             # MarkMLl
77a78
> my $extensions_regex = join '|', @EXT;          # MarkMLl
132c133,138
<   my @files = grep {-f and /^.+\.(.+)$/ and grep {/^\Q$1\E$/} @EXT}
@contents;
---
> #  my @files = grep {-f and /^.+\.(.+)$/ and grep {/^\Q$1\E$/} @EXT} @contents;
>   my @files = grep {-f and /^(?!\.)([^.]*(?<!~)|.*\.(?:$extensions_regex))$/o} @contents;
>
> # Above change to allow blank extension and exclude files starting with .
> # or (where there is no extension) ending with ~. MarkMLl, courtesy of
> # Hugo van der Sanden (hv@crypt0.demon.co.uk).
234c240,243
<   $buffer =~ s/-(\s*\n\s*)?//g;  # join parts of hyphenated words
---
> #  $buffer =~ s/-(\s*\n\s*)?//g;  # join parts of hyphenated words
>
> # The line above discards all hyphens. However, if we want to be able
> # to index telephone numbers it's worth deferring this. MarkMLl.
250a260,263
> # Block below modified such that if we want to index numbers particularly
> # telephone numbers - is converted to a space if there are three or mode digits
> # on one side and one or more on the other, else it is discarded.
>
251a265,266
>     $buffer =~ s/(?:(\d)-(?:\s*\n\s*)?(\d{3,}))|(?:(\d{3,})-(?:\s*\n\s*)?(\d))/$1 $2/g; # MarkMLl
>     $buffer =~ s/-(\s*\n\s*)?//g;                   # MarkMLl
253a269
>     $buffer =~ s/-(\s*\n\s*)?//g;  # join parts of hyphenated words MarkMLl