Perlfect Solutions
 

[Perlfect-search] Problems with indexing

Roger Growe roger_growe at earthlink.net
Mon Dec 6 21:18:50 GMT 2004
Problems with indexing

 

I 'd really appreciate a hand with this script.

After working though the instructions, all the correspondence in the mail list archives and other sources I am still getting this when starting the index:



 Using DB_File...
Checking for old temp files...
Building string of special characters...
Loading 'no index' regular expressions:
    - frontpage2.html
    - frontpage.html
[etc.]
 
Loading stopwords...371 stopwords loaded.
Starting crawler...
Note: I will not visit more than $HTTP_MAX_PAGES=150 pages.
Loading http://www.quinacrine.com/robots.txt...
Error: Couldn't get 'http://www.quinacrine.com/robots.txt': response code 500
Not using any robots.txt.
Error: Couldn't get 'http://www.quinacrine.com/index.html': response code 500
 
Crawler finished: indexed 0 files, 0 terms (0 different terms).
Ignored 0 files because of conf/no_index.txt
Ignored 0 files because of robots.txt
 
I thought it might be the structure of the site that was the problem.  The pages are not in the root but in the 'web' directory, like so:

 

root
        .config
        .sessions
        cgi-bin
        logs
        web
 
In cgi-bin I have these:
 
searchsite
        conf
        data
        Perlfect
        temp
        templates
 

I installed manually by necessity and need to index though http for the same reason.  All syntax, permissions, and other rules that I can find check out. Unix server.

 

The main sections of config.pl look like this now:

 

 

$DOCUMENT_ROOT = 'http://www.quinacrine.com/';

 

# The base url of your site (normally that's the URL which

# corresponds to $DOCUMENT_ROOT).

$BASE_URL = 'http://www.quinacrine.com';

 

# The url in which Perlfect Search is located (usually somewhere in cgi-bin/).

$CGIBIN = "/cgi-bin/searchsite/";

 

# The full-path of the directory where Perlfect Search is installed.

$INSTALL_DIR = '/nfs/cust/5/80/46/564085/cgi-bin/searchsite/';

 

# Only files with these extensions should be indexed (case-sensitive). 

# This is only relevant for file system indexing, when you index files via

# http you need to set @HTTP_CONTENT_TYPES instead. [re-index]

@EXT = ("html", "htm", "shtml", "txt");

 

 

[Password section]

 

###########################################################################

### http configuration

### You only need this if you want to index your pages via http

 

# Where you want the indexer to start via http. Leave empty if

# you want to index the files in the filesystem ($DOCUMENT_ROOT).

# ** WARNING **: Do not use for foreign servers! It might use too many

# resources on other people's servers. [re-index]

# example: $HTTP_START_URL = 'http://localhost/';

$HTTP_START_URL = 'http://www.quinacrine.com/index.html';

 

Thinking that the file structure could be the issue, I put a copy of robots.txt in the root, still the 500 response.  I've left $HTTP_START_URL =  blank, used 'http://www.quinacrine.com/' as well as other things I could think of to break this jam.

 

Thanks for your help in advance,

 

Roger Growe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://hottub.perlfect.com/pipermail/perlfect-search/attachments/20041206/2eccfe62/attachment.html