Perlfect Search configuration file
$rcs = ' $Id: conf.shtml,v 1.7 2003/02/24 23:08:26 daniel Exp $ ' ; |
| |
NOTE: Whenever you change one of the options that's marked with [re-index]
you need to run indexer.pl again to make the change take effect. |
| |
basic configuration
You'll have to adapt these values if you didn't use setup.pl |
| |
| Where do you want the indexer to start on your disk?
** Note ** : If your files are generated dynamically (e.g. via PHP)
you should set $HTTP_START_URL (see below), otherwise users
will be able to see your pages' source code using the
"highlight matches" link.
[re-index] |
| $DOCUMENT_ROOT = '/home/perlfect/perlfect.com/html/'; |
| |
| The base url of your site (normally that's the URL which
corresponds to $DOCUMENT_ROOT). |
| $BASE_URL = 'http://localhost'; |
| |
| The url in which Perlfect Search is located (usually somewhere in cgi-bin/). |
| $CGIBIN = "/cgi-bin/search/"; |
| |
| The full-path of the directory where Perlfect Search is installed. |
| $INSTALL_DIR = '/home/perlfect/perlfect.com/cgi-bin/search/'; |
| |
| Only files with these extensions should be indexed (case-sensitive).
This is only relevant for file system indexing, when you index files via
http you need to set @HTTP_CONTENT_TYPES instead. [re-index] |
| @EXT = ("html", "htm", "shtml"); |
| |
| If you do not have telnet/ssh access to the server that runs the script, you
need to execute indexer.pl via CGI. Of course not everybody should be able
to do that, so set a password with this option.
** Note ** : Only use this if absolutely necessary! Setting to "" disables
execution as a CGI, which is much more secure. Note that other people on
your server can probably read this file and look up your password. |
| $INDEXER_CGI_PASSWORD = ""; |
| |
http configuration
You only need this if you want to index your pages via http |
| |
| Where you want the indexer to start via http. Leave empty if
you want to index the files in the filesystem ($DOCUMENT_ROOT).
** WARNING **: Do not use for foreign servers! It might use too many
resources on other people's servers. [re-index]
example: $HTTP_START_URL = 'http://localhost/'; |
| $HTTP_START_URL = ''; |
| |
| The indexer might not notice if it runs into an endless loop. To void
that, set this to the maximum number of pages that will be visited
(this can be bigger than the number of pages indexed). [re-index] |
| $HTTP_MAX_PAGES = 100; |
| |
| The web server's document root. Normally that's the same as $DOCUMENT_ROOT,
it differs if you're only using Perlfect Search on a subdirectory. [re-index] |
| $HTTP_SERVER_ROOT = $DOCUMENT_ROOT; |
| |
| Limit crawling to these URL pattern. This is an important setting so
the script doesn't run out of control.
** WARNING **: The default ($HTTP_START_URL) should not be changed,
otherwise you risk the script to crawl on remote servers. For example,
the robots.txt file will only be used on the $HTTP_START_URL server!
[re-index] |
| @HTTP_LIMIT_URLS = ($HTTP_START_URL); |
| |
| Comment this out if you want to ignore robots.txt (only do that if
you really know what you are doing): |
| $ROBOT_AGENT = 'perlfectsearch'; |
| |
| Should the indexer follow links that are commented out? |
| $HTTP_FOLLOW_COMMENT_LINKS = 1; |
| |
| Only if indexing via http: the content types to index.
Add 'application/msword' for for MS-Word,
'application/pdf' for PDF. [re-index] |
| @HTTP_CONTENT_TYPES = ('text/html', 'text/plain'); |
| |
| Set to 1 to get verbose output during indexing. [re-index] |
| $HTTP_DEBUG = 1; |
| |
advanced configuration
You only need this if you want to adapt advanced features |
| |
| Programs that convert other formats to ascii text.
The name of the file to be filtered is passed as FILENAME, and the command
must print out ascii (or latin1) text.
pdftotext is part of xpdf, available at
http://www.foolabs.com/xpdf/download.html
antiword is available at http://www.winfield.demon.nl/
NOTE: You also have to set @EXT or @HTTP_CONTENT_TYPES accordingly.
If there's a problem with pdftotext, try a new version or hand over
the -raw option to pdftotext.
[re-index] |
%EXT_FILTER = (
"pdf" => "/usr/bin/pdftotext FILENAME -",
"doc" => "/usr/bin/antiword FILENAME"
); |
| |
| How many results should be shown per page. |
| $RESULTS_PER_PAGE = 5; |
| |
| Limit the number of results. 0 = no limit. |
| $MAX_RESULTS = 0; |
| |
| Enable the "highlight matches" feature that displays the original
pages, but with the search terms highlighted. See the README on
restrictions of this feature. |
| $HIGHLIGHT_MATCHES = 1; |
| |
| A "highlight matches" link does only work for HTML files, so only
offer such a link for files with these suffixes.
** Note **: If $HTTP_START_URL is not set, the highlighting
will load the file from disk so that the user might find
passwords in the highlightes file! So don't set to include
dynamic file, unless you are using $HTTP_START_URL. |
| @HIGHLIGHT_EXT = ("html", "htm"); |
| |
| Perlfect Search can highlight the search terms in the matching
document. These are the colors that will be used for the background
of the terms (the browser must support CSS for this). If the last color
is used, the first one will be used again if there are still terms left. |
| @HIGHLIGHT_COLORS = ('#4fafea', '#e5b547', '#aaaaaa', '#ee77ee'); |
| |
| Show the ranking in percent, with the first document = 100%. |
| $PERCENTAGE_RANKING = 1; |
| |
| Do you want to index numbers? If so set $INDEX_NUMBERS to 1. [re-index] |
| $INDEX_NUMBERS = 0; |
| |
| If you don't have enough memory, set this to 1. This will slow down
indexer.pl by a factor of about 2. Searching is not affected. |
| $LOW_MEMORY_INDEX = 1; |
| |
| How much of the document should be put in the index? With this option,
the context of the match is shown on the results page. This only works
if the match was in the first $CONTEXT_SIZE bytes of the document.
Warning: Using this option will generate a very big index file.
Set to 0 to disable, set to -1 for no limit. [re-index] |
| $CONTEXT_SIZE = 0; |
| |
| If $CONTEXT_SIZE is enabled, how many occurences of every term should be shown
on the results page? |
| $CONTEXT_EXAMPLES = 2; |
| |
| If $CONTEXT_SIZE is enabled, how many words should be used to show the context
of a term? |
| $CONTEXT_DESC_WORDS = 12; |
| |
| How many words should be used from the <BODY> of an html document as a
description for the document in case there is no <META description> tag
available and $CONTEXT_SIZE is 0. [re-index] |
| $DESC_WORDS = 25; |
| |
| The minimum length of a word. Any word of smaller size is not indexed.
[re-index] |
| $MINLENGTH = 3; |
| |
| If you have umlauts or accents etc. in your text, enable this.
With this option accented characters will be indexed as the characters
they are based on, without this option they will
be filtered out completely (you don't want that). [re-index] |
| $SPECIAL_CHARACTERS = 1; |
| |
| The largest acceptable word size. Reducing this saves space but decreases
result accuracy. Setting the variable to 0 ignores stemming alltogether.
[re-index] |
| $STEMCHARS = 0; |
| |
| Add URLs to the index, so one can search for them? Note that special
characters will be ignored, just as in normal text. [re-index] |
| $INDEX_URLS = 0; |
| |
| You can completely ignore certain parts of your documents if you put these
HTML comments around them. [re-index] |
$IGNORE_TEXT_START = '<!--ignore_perlfect_search-->';
$IGNORE_TEXT_END = '<!--/ignore_perlfect_search-->'; |
| |
| The maximum length of <title> elements, everything longer than this
will be cut off. [re-index] |
| $MAX_TITLE_LENGTH = 80; |
| |
| How much more important are words found in the title, in the meta values
(author, description, keywords), and in the headlines compared to normal
text in the body? This influences the ranking of the results.
Use any integer (0 = ignore that text completely) [re-index] |
$TITLE_WEIGHT = 5;
$META_WEIGHT = 5;
$H_WEIGHT{'1'} = 5; # headline <h1>...</h1>
$H_WEIGHT{'2'} = 4;
$H_WEIGHT{'3'} = 3;
$H_WEIGHT{'4'} = 1;
$H_WEIGHT{'5'} = 1;
$H_WEIGHT{'6'} = 1; # headline <h6>...</h6> |
| |
| If you want to log the queries to an extra file, set this to 1.
Every use of search.pl will then be logged to data/log.txt. That file
has to exist and must be writable for the webserver. The line format is:
REMOTE_HOST;date;terms;matches;current page;(time to search in seconds);
NOTE: You'll have to comment in two lines at the top of search.pl to get the
time value (see the comment there).
NOTE: if you have many queries, this file will grow quite fast. |
| $LOG = 0; |
| |
| This will increase the score of results that contain more than one of
the searched terms. Queries with only one term will not be affected.
The number given here is a factor that multiplies the score (even
several times, if there are more than two terms). 0 turns it off. |
| $MULTIPLE_MATCH_BOOST = 0; |
| |
| Date format for the result page. %Y = year, %m = month, %d = day,
%H = hour, %M = minute, %S = second. On a Unix system use
'man strftime' to get a list of all possible options. |
| $DATE_FORMAT = "%Y-%m-%d"; |
| |
| Date format for the "Latest Index update" information on the result page. |
| $INDEX_DATE_FORMAT = "%Y-%m-%d %H:%M"; |
| |
| Directory with templates (normally you don't have to modify this). |
| $TEMPLATE_DIR = $INSTALL_DIR.'templates/'; |
| |
| What's the default language. This is the language that's used if no lang
parameter is passed to the script or if the parameter is invalid. |
| $DEFAULT_LANG = 'en'; |
| |
| The result templates for several languages. |
$SEARCH_TEMPLATE{'en'} = $TEMPLATE_DIR.'search.html';
$SEARCH_TEMPLATE{'de'} = $TEMPLATE_DIR.'search_de.html';
$SEARCH_TEMPLATE{'fr'} = $TEMPLATE_DIR.'search_fr.html';
$SEARCH_TEMPLATE{'it'} = $TEMPLATE_DIR.'search_it.html';
$NO_MATCH_TEMPLATE{'en'} = $TEMPLATE_DIR.'no_match.html';
$NO_MATCH_TEMPLATE{'de'} = $TEMPLATE_DIR.'no_match_de.html';
$NO_MATCH_TEMPLATE{'fr'} = $TEMPLATE_DIR.'no_match_fr.html';
$NO_MATCH_TEMPLATE{'it'} = $TEMPLATE_DIR.'no_match_it.html';
# This is the template for using search.pl via command line:
$SEARCH_TEMPLATE{'text'} = $TEMPLATE_DIR.'search.txt';
$NO_MATCH_TEMPLATE{'text'} = $TEMPLATE_DIR.'no_match.txt';
# This is the template for using the test cases (development only):
$SEARCH_TEMPLATE{'qa'} = $INSTALL_DIR.'qa/search_qa.txt';
$NO_MATCH_TEMPLATE{'qa'} = $INSTALL_DIR.'qa/no_match_qa.txt'; |
| |
| The text for the "Next Page" link in several languages. |
$NEXT_PAGE{'en'} = 'Next';
$NEXT_PAGE{'de'} = 'naechste Seite';
$NEXT_PAGE{'fr'} = 'Suivant';
$NEXT_PAGE{'it'} = 'Successiva'; |
| |
| The text for the "Previous Page" link in several languages. |
$PREV_PAGE{'en'} = 'Previous';
$PREV_PAGE{'de'} = 'vorige Seite';
$PREV_PAGE{'fr'} = 'Precedent';
$NEXT_PAGE{'it'} = 'Precedente'; |
| |
| Text of the link that shows a colored backround for matched terms: |
$HIGHLIGHT_TERMS{'en'} = 'highlight matches';
$HIGHLIGHT_TERMS{'de'} = 'Treffer hervorheben'; |
| |
| The text for the "too common" warning. <WORDS> will be replaced with
a list of the ignored words. If there are no ignored words, this text
will not appear. |
$IGNORED_WORDS{'en'} = '<p>The following words are either too short or very common and were
not included in your search: <strong><WORDS></strong></p>';
|
| |
You shouldn't have to edit anything below this line. |
| |
| Various paths (do NOT use system-wide /tmp for security reasons!) |
$TMP_DIR = $INSTALL_DIR.'temp/';
$DATA_DIR = $INSTALL_DIR.'data/';
$CONF_DIR = $INSTALL_DIR."conf/";
$STOPWORDS_FILE = $CONF_DIR.'stopwords.txt';
$NO_INDEX_FILE = $CONF_DIR.'no_index.txt';
$LOGFILE = $DATA_DIR.'log.txt';
$SEARCH = 'search.pl';
$SEARCH_URL = $CGIBIN.$SEARCH;
$UPDATE_FILE = $DATA_DIR.'update'; |
| |
| Paths to the database files. |
$INV_INDEX_DB_FILE = $DATA_DIR.'inv_index';
$DOCS_DB_FILE = $DATA_DIR.'docs';
$URLS_DB_FILE = $DATA_DIR.'urls';
$SIZES_DB_FILE = $DATA_DIR.'sizes';
$TERMS_DB_FILE = $DATA_DIR.'terms';
$DF_DB_FILE = $DATA_DIR.'df';
$TF_DB_FILE = $DATA_DIR.'tf';
$CONTENT_DB_FILE = $DATA_DIR.'content';
$DESC_DB_FILE = $DATA_DIR.'desc';
$TITLES_DB_FILE = $DATA_DIR.'titles';
$DATES_DB_FILE = $DATA_DIR.'dates'; |
| |
| Paths to the temporary database files. |
$INV_INDEX_TMP_DB_FILE = $DATA_DIR.'inv_index_tmp';
$DOCS_TMP_DB_FILE = $DATA_DIR.'docs_tmp';
$URLS_TMP_DB_FILE = $DATA_DIR.'urls_tmp';
$SIZES_TMP_DB_FILE = $DATA_DIR.'sizes_tmp';
$TERMS_TMP_DB_FILE = $DATA_DIR.'terms_tmp';
$CONTENT_TMP_DB_FILE = $DATA_DIR.'content_tmp';
$DESC_TMP_DB_FILE = $DATA_DIR.'desc_tmp';
$TITLES_TMP_DB_FILE = $DATA_DIR.'titles_tmp';
$DATES_TMP_DB_FILE = $DATA_DIR.'dates_tmp'; |
| |