|
|
[Perlfect-search] Dynamic Page (PHP pages) Indexing Problems
Michael Borck perlfect-search@perlfect.com
Wed, 25 Feb 2004 23:33:46 +0800
Hi All,
I had a perlfect working. Then moved the site from predomintly .shtml
pages to .php pages. Now because they are php pages and I don't really
want all the "code" to be indexed(highlight matches etc) I need to set
$HTTP_START_URL Now I am having two problems, potentially related, with
indexing as follows:
1. Command line indexing causes a permissions error
2. index_form.html solves permissions problem but index zero (0) files
3. modify $START_URL to further up the tree, index html files but not
PHP
Can any one help?
The long winded version of the problems follows apologies for the
length but I am at a loss wht to do.
I have tried to index via the following methods
Index from the command line: /usr/bin/perl indexer.pl
Using index_form.html (action modified to suit local perlfect
install)
If I use the command line to index, when I search, I get the following
error regardles if I set the $START_URL (causing it to retrieve dynamic
pages) or unset causing it to index via the filesystem:
Software error:
Cannot open
/usr/data/slash/Curtin/units/ec100/cgi_bin/perlfect/search/data/
inv_index: Permission denied at
/usr/apache+ssl/htdocs/www.computing.edu.au/units/ec100/cgi_bin/
perlfect/search/search.cgi line 76.
If I index via index_form.html
I do not get the error. I get the following output.
Loading https://www.computing.edu.au/robots.txt...
Error: Couldn't get 'https://www.computing.edu.au/robots.txt': response
code 501
[Wed Feb 25 22:47:18 2004] indexer.cgi: Not using any robots.txt.
Warning: cannot remove '/usr/data/slash/Curtin/units/ec100/' from
'https://www.computing.edu.au/units/ec100/'
[Wed Feb 25 22:47:18 2004] indexer.cgi: Use of uninitialized value in
pattern match (m//) at ./indexer.cgi line 568.
[Wed Feb 25 22:47:18 2004] indexer.cgi: Use of uninitialized value in
concatenation (.) or string at ./indexer.cgi line 569.
Error: Couldn't get 'https://www.computing.edu.au/units/ec100/':
response code 500
Crawler finished: indexed 0 files, 0 terms (0 different terms).
etc....
I can now use search_form.html successfully, no error obviously finding
nothing. Now I have not changed conf.pl, just the method of indexing.
When I look in the $PERLFECT_HOME/data I observe all files are 755 and
owner/group nobody.
Try command line again, change permission and owner to match, still get
"Software error:...." ummm strange, oh well indexing via cgi, although
less secure looks more promising.
Umm "index 0 files" and "response code 500". I suspect this is me
confusing $DOCUMENT_ROOT, $BASE_URL and $START_URL. I call the techos
to have a look at the server logs and they mention something about
robots.txt (umm sort guessed that). I ask if they had anything
installed to block spiders/crawler etc, none.
So I comment out the $ROBOT_AGENT (ignoring the comment and "really
know what you are doing") and get the following error:
Warning: cannot remove '/usr/data/slash/Curtin/units/ec100/' from
'https://www.computing.edu.au/units/ec100/'
[Wed Feb 25 23:03:47 2004] indexer.cgi: Use of uninitialized value in
pattern match (m//) at ./indexer.cgi line 568.
[Wed Feb 25 23:03:47 2004] indexer.cgi: Use of uninitialized value in
concatenation (.) or string at ./indexer.cgi line 569.
Error: Couldn't get 'https://www.computing.edu.au/units/ec100/':
response code 501
Okay, lets have a play around with the $DOCUMENT_ROOT, $BASE_URL,
$START_URL. Now I look after all web pages under the main site. Our
home page is:
https://www.computing.edu.au/
I maintain
https://www.computing.edu/units/ec100/
I only want to index pages in my area. Now after various combinations
the only one that will index pages is setting the $START_URL to:
http://www.computing.edu.au/
The values of the other variables does not seem to affect the indexer,
and also notice the protocol has changed from https:// to http://
This still only index html files (regardless of setting in conf.pl). I
have tried making my start page a html file (rather than a php) still
no luck.
Getting rather desperate now suspect it might be a missing library
(response code 501 -- service not provided). Checking back on the
perlfect website, and downloading a useful little utility wpm.pl (Where
is Perl Module) it looks like the LWP::libwww-perl is not installed.
Did I mention that I do not have root access. So I install the module
in $PERLFECT_HOME and add the following line to search.cgi, indexer.cgi
and indexer_web.cgi (I also modified the file to require .cgi and not
.pl file).
use lib 'lib/perl5/site_perl/5.8.1';
print "\@INC is @INC\n";
The path looks okay but still no luck indexing dynamic pages.
Can anyone help?
Michael.
--
|
|