|
|
[Perlfect-search] Problems with indexing
Roger Growe roger_growe at earthlink.net
Mon Dec 6 21:18:50 GMT 2004
Problems with indexing
I 'd really appreciate a hand with this script.
After working though the instructions, all the correspondence in the mail list archives and other sources I am still getting this when starting the index:
Using DB_File...
Checking for old temp files...
Building string of special characters...
Loading 'no index' regular expressions:
- frontpage2.html
- frontpage.html
[etc.]
Loading stopwords...371 stopwords loaded.
Starting crawler...
Note: I will not visit more than $HTTP_MAX_PAGES=150 pages.
Loading http://www.quinacrine.com/robots.txt...
Error: Couldn't get 'http://www.quinacrine.com/robots.txt': response code 500
Not using any robots.txt.
Error: Couldn't get 'http://www.quinacrine.com/index.html': response code 500
Crawler finished: indexed 0 files, 0 terms (0 different terms).
Ignored 0 files because of conf/no_index.txt
Ignored 0 files because of robots.txt
I thought it might be the structure of the site that was the problem. The pages are not in the root but in the 'web' directory, like so:
root
.config
.sessions
cgi-bin
logs
web
In cgi-bin I have these:
searchsite
conf
data
Perlfect
temp
templates
I installed manually by necessity and need to index though http for the same reason. All syntax, permissions, and other rules that I can find check out. Unix server.
The main sections of config.pl look like this now:
$DOCUMENT_ROOT = 'http://www.quinacrine.com/';
# The base url of your site (normally that's the URL which
# corresponds to $DOCUMENT_ROOT).
$BASE_URL = 'http://www.quinacrine.com';
# The url in which Perlfect Search is located (usually somewhere in cgi-bin/).
$CGIBIN = "/cgi-bin/searchsite/";
# The full-path of the directory where Perlfect Search is installed.
$INSTALL_DIR = '/nfs/cust/5/80/46/564085/cgi-bin/searchsite/';
# Only files with these extensions should be indexed (case-sensitive).
# This is only relevant for file system indexing, when you index files via
# http you need to set @HTTP_CONTENT_TYPES instead. [re-index]
@EXT = ("html", "htm", "shtml", "txt");
[Password section]
###########################################################################
### http configuration
### You only need this if you want to index your pages via http
# Where you want the indexer to start via http. Leave empty if
# you want to index the files in the filesystem ($DOCUMENT_ROOT).
# ** WARNING **: Do not use for foreign servers! It might use too many
# resources on other people's servers. [re-index]
# example: $HTTP_START_URL = 'http://localhost/';
$HTTP_START_URL = 'http://www.quinacrine.com/index.html';
Thinking that the file structure could be the issue, I put a copy of robots.txt in the root, still the 500 response. I've left $HTTP_START_URL = blank, used 'http://www.quinacrine.com/' as well as other things I could think of to break this jam.
Thanks for your help in advance,
Roger Growe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://hottub.perlfect.com/pipermail/perlfect-search/attachments/20041206/2eccfe62/attachment.html
|
|