Perlfect Solutions
 

[Perlfect-search] Dynamic Page (PHP pages) Indexing Problems

Michael Borck perlfect-search@perlfect.com
Wed, 25 Feb 2004 23:33:46 +0800
Hi All,

I had a perlfect working.  Then moved the site from predomintly .shtml  
pages to .php pages.  Now because they are php pages and I don't really  
want all the "code" to be indexed(highlight matches etc) I need to set  
$HTTP_START_URL Now I am having two problems, potentially related, with  
indexing as follows:

1. Command line indexing causes a permissions error
2. index_form.html solves permissions problem but index zero (0) files
3. modify $START_URL to further up the tree, index html files but not  
PHP

Can any one help?





The long winded version of the problems follows apologies for the  
length but I am at a loss wht to do.

I have tried to index via the following methods
   Index from the command line:  /usr/bin/perl indexer.pl
   Using index_form.html  (action modified to suit local perlfect  
install)

If I use the command line to index, when I search, I get the following  
error regardles if I set the $START_URL (causing it to retrieve dynamic  
pages) or unset causing it to index via the filesystem:

Software error:
Cannot open  
/usr/data/slash/Curtin/units/ec100/cgi_bin/perlfect/search/data/ 
inv_index: Permission denied at  
/usr/apache+ssl/htdocs/www.computing.edu.au/units/ec100/cgi_bin/ 
perlfect/search/search.cgi line 76.



If I index via index_form.html

I do not get the error. I get the following output.

Loading https://www.computing.edu.au/robots.txt...
Error: Couldn't get 'https://www.computing.edu.au/robots.txt': response  
code 501
[Wed Feb 25 22:47:18 2004] indexer.cgi: Not using any robots.txt.
Warning: cannot remove '/usr/data/slash/Curtin/units/ec100/' from  
'https://www.computing.edu.au/units/ec100/'
[Wed Feb 25 22:47:18 2004] indexer.cgi: Use of uninitialized value in  
pattern match (m//) at ./indexer.cgi line 568.
[Wed Feb 25 22:47:18 2004] indexer.cgi: Use of uninitialized value in  
concatenation (.) or string at ./indexer.cgi line 569.
Error: Couldn't get 'https://www.computing.edu.au/units/ec100/':  
response code 500

Crawler finished: indexed 0 files, 0 terms (0 different terms).
etc....

I can now use search_form.html successfully, no error obviously finding  
nothing.  Now I have not changed conf.pl, just the method of indexing.   
When I look in the $PERLFECT_HOME/data I observe all files are 755 and  
owner/group nobody.

Try command line again, change permission and owner to match, still get  
"Software error:...." ummm strange, oh well indexing via cgi, although  
less secure looks more promising.


Umm "index 0 files" and "response code 500".  I suspect this is me  
confusing $DOCUMENT_ROOT, $BASE_URL and $START_URL.   I call the techos  
to have a look at the server logs and they mention something about  
robots.txt (umm sort guessed that). I ask if they had anything  
installed to block spiders/crawler etc, none.

So I comment out the $ROBOT_AGENT (ignoring the comment and "really  
know what you are doing") and get the following error:

Warning: cannot remove '/usr/data/slash/Curtin/units/ec100/' from  
'https://www.computing.edu.au/units/ec100/'
[Wed Feb 25 23:03:47 2004] indexer.cgi: Use of uninitialized value in  
pattern match (m//) at ./indexer.cgi line 568.
[Wed Feb 25 23:03:47 2004] indexer.cgi: Use of uninitialized value in  
concatenation (.) or string at ./indexer.cgi line 569.
Error: Couldn't get 'https://www.computing.edu.au/units/ec100/':  
response code 501

Okay, lets have a play around with the $DOCUMENT_ROOT, $BASE_URL,  
$START_URL.  Now I look after all web pages under the main site.  Our  
home page is:

https://www.computing.edu.au/

I maintain

https://www.computing.edu/units/ec100/

I only want to index pages in my area.  Now after various combinations  
the only one that will index pages is setting the $START_URL to:

http://www.computing.edu.au/

The values of the other variables does not seem to affect the indexer,  
and also notice the protocol has changed from https:// to http://

This still only index html files (regardless of setting in conf.pl). I  
have tried making my start page a html file (rather than a php) still  
no luck.

Getting rather desperate now suspect it might be a missing library  
(response code 501  -- service not provided).  Checking back on the  
perlfect website, and downloading a useful little utility wpm.pl (Where  
is Perl Module) it looks like the LWP::libwww-perl is not installed.   
Did I mention that I do not have root access.  So I install the module  
in $PERLFECT_HOME and add the following line to search.cgi, indexer.cgi  
and indexer_web.cgi  (I also modified the file to require .cgi and not  
.pl file).

use lib 'lib/perl5/site_perl/5.8.1';
print "\@INC is @INC\n";

The path looks okay but still no luck indexing dynamic pages.

Can anyone help?

Michael.
--