Perlfect Solutions
 

[Perlfect-search] PDF's

Dave Filchak perlfect-search@perlfect.com
Fri, 28 Mar 2003 15:18:11 -0500
This is a multi-part message in MIME format.

------=_NextPart_000_0042_01C2F53D.3F0C3510
Content-Type: text/plain;
        charset="us-ascii"
Content-Transfer-Encoding: 8bit

Hey Daniel,

Actually they were indexed. Everything seemed to index properly but I got
all these errors?

conf.pl attached and I will check for an error.


Dave

-----Original Message-----
From: perlfect-search-admin@perlfect.com
[mailto:perlfect-search-admin@perlfect.com]On Behalf Of Daniel Naber
Sent: March 28, 2003 2:58 PM
To: perlfect-search@perlfect.com
Subject: Re: [Perlfect-search] PDF's


On Friday 28 March 2003 19:35, Dave Filchak wrote:

> Yes this now works with the PDF's but I am getting the following error
> numerous times as the indexer goes thru the site. I never got these
> before.

I assume the PDF files are not indexed when you get these errors, right?
Can you send me your conf.pl? There should be an error before all these
"uninitialized value" warnings, can you look for that?

Regards
 Daniel

--
http://www.danielnaber.de

_______________________________________________
perlfect-search mailing list
perlfect-search@perlfect.com
To unsubscribe, set other personal options or view the list archives please
visit:



------=_NextPart_000_0042_01C2F53D.3F0C3510
Content-Type: application/octet-stream;
        name="conf.pl"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
        filename="conf.pl"

# Perlfect Search configuration file=0A=
#$rcs =3D ' $Id: conf.pl,v 1.64 2003/02/24 21:10:16 daniel Exp $ ' ;=0A=
=0A=
# NOTE: Whenever you change one of the options that's marked with =
[re-index]=0A=
# you need to run indexer.pl again to make the change take effect.=0A=
=0A=
#########################################################################=
##=0A=
### basic configuration=0A=
### You'll have to adapt these values if you didn't use setup.pl=0A=
=0A=
# Where do you want the indexer to start on your disk?=0A=
# ** Note ** : If your files are generated dynamically (e.g. via PHP)=0A=
# you should set $HTTP_START_URL (see below), otherwise users=0A=
# will be able to see your pages' source code using the=0A=
# "highlight matches" link.=0A=
# [re-index]=0A=
$DOCUMENT_ROOT =3D 'C:\Inetpub\IWH\/';=0A=
=0A=
# The base url of your site (normally that's the URL which=0A=
# corresponds to $DOCUMENT_ROOT).=0A=
$BASE_URL =3D 'http://iwh.zuka.net';=0A=
=0A=
# The url in which Perlfect Search is located (usually somewhere in =
cgi-bin/).=0A=
$CGIBIN =3D 'http://iwh.zuka.net/cgi-bin/perlfect/search/';=0A=
=0A=
# The full-path of the directory where Perlfect Search is installed.=0A=
$INSTALL_DIR =3D 'C:\Inetpub\IWH\cgi-bin\/perlfect/search/';=0A=
=0A=
# Only files with these extensions should be indexed (case-sensitive). =0A=
# This is only relevant for file system indexing, when you index files =
via=0A=
# http you need to set @HTTP_CONTENT_TYPES instead. [re-index]=0A=
@EXT =3D ("htm","html","shtml","txt","pdf","doc","php");=0A=
=0A=
# If you do not have telnet/ssh access to the server that runs the =
script, you=0A=
# need to execute indexer.pl via CGI. Of course not everybody should be =
able=0A=
# to do that, so set a password with this option.=0A=
# ** Note ** : Only use this if absolutely necessary! Setting to "" =
disables =0A=
# execution as a CGI, which is much more secure. Note that other people =
on=0A=
# your server can probably read this file and look up your password.=0A=
$INDEXER_CGI_PASSWORD =3D "admin";=0A=
=0A=
#########################################################################=
##=0A=
### http configuration=0A=
### You only need this if you want to index your pages via http=0A=
=0A=
# Where you want the indexer to start via http. Leave empty if=0A=
# you want to index the files in the filesystem ($DOCUMENT_ROOT).=0A=
# ** WARNING **: Do not use for foreign servers! It might use too many=0A=
# resources on other people's servers. [re-index]=0A=
# example: $HTTP_START_URL =3D 'http://localhost/';=0A=
$HTTP_START_URL =3D '';=0A=
=0A=
# The indexer might not notice if it runs into an endless loop. To void=0A=
# that, set this to the maximum number of pages that will be visited=0A=
# (this can be bigger than the number of pages indexed). [re-index]=0A=
$HTTP_MAX_PAGES =3D 100;=0A=
=0A=
# The web server's document root. Normally that's the same as =
$DOCUMENT_ROOT,=0A=
# it differs if you're only using Perlfect Search on a subdirectory. =
[re-index]=0A=
$HTTP_SERVER_ROOT =3D $DOCUMENT_ROOT;=0A=
=0A=
# Limit crawling to these URL pattern. This is an important setting so =0A=
# the script doesn't run out of control. =0A=
# ** WARNING **: The default ($HTTP_START_URL) should not be changed,=0A=
# otherwise you risk the script to crawl on remote servers. For example,=0A=
# the robots.txt file will only be used on the $HTTP_START_URL server!=0A=
# [re-index]=0A=
@HTTP_LIMIT_URLS =3D ($HTTP_START_URL);=0A=
=0A=
# Comment this out if you want to ignore robots.txt (only do that if=0A=
# you really know what you are doing):=0A=
$ROBOT_AGENT =3D 'perlfectsearch';=0A=
=0A=
# Should the indexer follow links that are commented out?=0A=
$HTTP_FOLLOW_COMMENT_LINKS =3D 1;=0A=
=0A=
# Only if indexing via http: the content types to index. =0A=
# Add 'application/msword' for for MS-Word, =0A=
# 'application/pdf' for PDF. [re-index]=0A=
@HTTP_CONTENT_TYPES =3D ('text/html', 'text/plain', 'application/pdf');=0A=
=0A=
# Set to 1 to get verbose output during indexing. [re-index]=0A=
$HTTP_DEBUG =3D 1;=0A=
=0A=
#########################################################################=
##=0A=
### advanced configuration=0A=
### You only need this if you want to adapt advanced features=0A=
=0A=
# Programs that convert other formats to ascii text.=0A=
# The name of the file to be filtered is passed as FILENAME, and the =
command=0A=
# must print out ascii (or latin1) text.=0A=
# pdftotext is part of xpdf, available at=0A=
# http://www.foolabs.com/xpdf/download.html=0A=
# antiword is available at http://www.winfield.demon.nl/=0A=
# NOTE: You also have to set @EXT or @HTTP_CONTENT_TYPES accordingly.=0A=
# If there's a problem with pdftotext, try a new version or hand over=0A=
# the -raw option to pdftotext.=0A=
# [re-index]=0A=
%EXT_FILTER =3D (=0A=
           "pdf" =3D> "c:/xpdf/pdftotext -raw FILENAME -",=0A=
           #"doc" =3D> "/usr/bin/antiword FILENAME"=0A=
);=0A=
=0A=
# How many results should be shown per page.=0A=
$RESULTS_PER_PAGE =3D 10;=0A=
=0A=
# Limit the number of results. 0 =3D no limit.=0A=
$MAX_RESULTS =3D 0;=0A=
=0A=
# Enable the "highlight matches" feature that displays the original=0A=
# pages, but with the search terms highlighted. See the README on=0A=
# restrictions of this feature.=0A=
$HIGHLIGHT_MATCHES =3D 1;=0A=
=0A=
# A "highlight matches" link does only work for HTML files, so only=0A=
# offer such a link for files with these suffixes.=0A=
# ** Note **: If $HTTP_START_URL is not set, the highlighting=0A=
# will load the file from disk so that the user might find=0A=
# passwords in the highlightes file! So don't set to include=0A=
# dynamic file, unless you are using $HTTP_START_URL.=0A=
@HIGHLIGHT_EXT =3D ("html", "htm");=0A=
=0A=
# Perlfect Search can highlight the search terms in the matching=0A=
# document. These are the colors that will be used for the background=0A=
# of the terms (the browser must support CSS for this). If the last =
color =0A=
# is used, the first one will be used again if there are still terms =
left.=0A=
@HIGHLIGHT_COLORS =3D ('#4fafea', '#e5b547', '#aaaaaa', '#ee77ee');=0A=
=0A=
# Show the ranking in percent, with the first document =3D 100%.=0A=
$PERCENTAGE_RANKING =3D 1;=0A=
=0A=
# Do you want to index numbers? If so set $INDEX_NUMBERS to 1. [re-index]=0A=
$INDEX_NUMBERS =3D 0;=0A=
=0A=
# If you don't have enough memory, set this to 1. This will slow down =0A=
# indexer.pl by a factor of about 2. Searching is not affected.=0A=
$LOW_MEMORY_INDEX =3D 1;=0A=
=0A=
# How much of the document should be put in the index? With this option,=0A=
# the context of the match is shown on the results page. This only works=0A=
# if the match was in the first $CONTEXT_SIZE bytes of the document.=0A=
# Warning: Using this option will generate a very big index file.=0A=
# Set to 0 to disable, set to -1 for no limit. [re-index]=0A=
$CONTEXT_SIZE =3D 0;=0A=
=0A=
# If $CONTEXT_SIZE is enabled, how many occurences of every term should =
be shown=0A=
# on the results page?=0A=
$CONTEXT_EXAMPLES =3D 2;=0A=
=0A=
# If $CONTEXT_SIZE is enabled, how many words should be used to show the =
context=0A=
# of a term?=0A=
$CONTEXT_DESC_WORDS =3D 12;=0A=
=0A=
# How many words should be used from the <BODY> of an html document as a =0A=
# description for the document in case there is no <META description> =
tag =0A=
# available and $CONTEXT_SIZE is 0. [re-index]=0A=
$DESC_WORDS =3D 25;=0A=
=0A=
# The minimum length of a word. Any word of smaller size is not indexed. =0A=
# [re-index]=0A=
$MINLENGTH =3D 3;=0A=
=0A=
# If you have umlauts or accents etc. in your text, enable this.=0A=
# With this option accented characters will be indexed as the characters=0A=
# they are based on (e.g. =E8 -> e, =FC -> u), without this option they =
will=0A=
# be filtered out completely (you don't want that). [re-index]=0A=
$SPECIAL_CHARACTERS =3D 1;=0A=
=0A=
# The largest acceptable word size. Reducing this saves space but =
decreases=0A=
# result accuracy. Setting the variable to 0 ignores stemming =
alltogether.=0A=
# [re-index]=0A=
$STEMCHARS =3D 0;=0A=
=0A=
# Add URLs to the index, so one can search for them? Note that special=0A=
# characters will be ignored, just as in normal text. [re-index]=0A=
$INDEX_URLS =3D 0;=0A=
=0A=
# You can completely ignore certain parts of your documents if you put =
these =0A=
# HTML comments around them. [re-index]=0A=
$IGNORE_TEXT_START =3D '<!--ignore_perlfect_search-->';=0A=
$IGNORE_TEXT_END =3D '<!--/ignore_perlfect_search-->';=0A=
=0A=
# The maximum length of <title> elements, everything longer than this=0A=
# will be cut off. [re-index]=0A=
$MAX_TITLE_LENGTH =3D 80;=0A=
=0A=
# How much more important are words found in the title, in the meta =
values=0A=
# (author, description, keywords), and in the headlines compared to =
normal =0A=
# text in the body? This influences the ranking of the results.=0A=
# Use any integer (0 =3D ignore that text completely) [re-index]=0A=
$TITLE_WEIGHT =3D 5;=0A=
$META_WEIGHT =3D 5;=0A=
$H_WEIGHT{'1'} =3D 5;   # headline <h1>...</h1>=0A=
$H_WEIGHT{'2'} =3D 4;=0A=
$H_WEIGHT{'3'} =3D 3;=0A=
$H_WEIGHT{'4'} =3D 1;=0A=
$H_WEIGHT{'5'} =3D 1;=0A=
$H_WEIGHT{'6'} =3D 1;   # headline <h6>...</h6>=0A=
=0A=
# If you want to log the queries to an extra file, set this to 1.=0A=
# Every use of search.pl will then be logged to data/log.txt. That file=0A=
# has to exist and must be writable for the webserver. The line format =
is:=0A=
# REMOTE_HOST;date;terms;matches;current page;(time to search in =
seconds);=0A=
# NOTE: You'll have to comment in two lines at the top of search.pl to =
get the =0A=
# time value (see the comment there).=0A=
# NOTE: if you have many queries, this file will grow quite fast.=0A=
$LOG =3D 0;=0A=
=0A=
# This will increase the score of results that contain more than one of=0A=
# the searched terms. Queries with only one term will not be affected.=0A=
# The number given here is a factor that multiplies the score (even=0A=
# several times, if there are more than two terms). 0 turns it off.=0A=
$MULTIPLE_MATCH_BOOST =3D 0;=0A=
=0A=
# Date format for the result page. %Y =3D year, %m =3D month, %d =3D day,=0A=
# %H =3D hour, %M =3D minute, %S =3D second. On a Unix system use =0A=
# 'man strftime' to get a list of all possible options.=0A=
$DATE_FORMAT =3D "%Y-%m-%d";=0A=
=0A=
# Date format for the "Latest Index update" information on the result =
page.=0A=
$INDEX_DATE_FORMAT =3D "%Y-%m-%d %H:%M";=0A=
=0A=
# Directory with templates (normally you don't have to modify this).=0A=
$TEMPLATE_DIR =3D $INSTALL_DIR.'templates/';=0A=
=0A=
# What's the default language. This is the language that's used if no =
lang=0A=
# parameter is passed to the script or if the parameter is invalid.=0A=
$DEFAULT_LANG =3D 'en';=0A=
=0A=
# The result templates for several languages.=0A=
$HEADER =3D $TEMPLATE_DIR.'header_01.inc.htm';=0A=
$FOOTER =3D $TEMPLATE_DIR.'footer_01.inc.htm';=0A=
$SEARCH_TEMPLATE{'en'} =3D $TEMPLATE_DIR.'search.asp';=0A=
$SEARCH_TEMPLATE{'de'} =3D $TEMPLATE_DIR.'search_de.html';=0A=
$SEARCH_TEMPLATE{'fr'} =3D $TEMPLATE_DIR.'search_fr.html';=0A=
$SEARCH_TEMPLATE{'it'} =3D $TEMPLATE_DIR.'search_it.html';=0A=
$NO_MATCH_TEMPLATE{'en'} =3D $TEMPLATE_DIR.'no_match.html';=0A=
$NO_MATCH_TEMPLATE{'de'} =3D $TEMPLATE_DIR.'no_match_de.html';=0A=
$NO_MATCH_TEMPLATE{'fr'} =3D $TEMPLATE_DIR.'no_match_fr.html';=0A=
$NO_MATCH_TEMPLATE{'it'} =3D $TEMPLATE_DIR.'no_match_it.html';=0A=
# This is the template for using search.pl via command line:=0A=
$SEARCH_TEMPLATE{'text'} =3D $TEMPLATE_DIR.'search.txt';=0A=
$NO_MATCH_TEMPLATE{'text'} =3D $TEMPLATE_DIR.'no_match.txt';=0A=
# This is the template for using the test cases (development only):=0A=
$SEARCH_TEMPLATE{'qa'} =3D $INSTALL_DIR.'qa/search_qa.txt';=0A=
$NO_MATCH_TEMPLATE{'qa'} =3D $INSTALL_DIR.'qa/no_match_qa.txt';=0A=
=0A=
# The text for the "Next Page" link in several languages.=0A=
$NEXT_PAGE{'en'} =3D 'Next';=0A=
$NEXT_PAGE{'de'} =3D 'n&auml;chste Seite';=0A=
$NEXT_PAGE{'fr'} =3D 'Suivant';=0A=
$NEXT_PAGE{'it'} =3D 'Successiva';=0A=
=0A=
# The text for the "Previous Page" link in several languages.=0A=
$PREV_PAGE{'en'} =3D 'Previous';=0A=
$PREV_PAGE{'de'} =3D 'vorige Seite';=0A=
$PREV_PAGE{'fr'} =3D 'Pr=E9c=E9dent';=0A=
$NEXT_PAGE{'it'} =3D 'Precedente';=0A=
=0A=
# Text of the link that shows a colored backround for matched terms:=0A=
$HIGHLIGHT_TERMS{'en'} =3D 'highlight matches';=0A=
$HIGHLIGHT_TERMS{'de'} =3D 'Treffer hervorheben';=0A=
=0A=
# The text for the "too common" warning. <WORDS> will be replaced with=0A=
# a list of the ignored words. If there are no ignored words, this text=0A=
# will not appear.=0A=
$IGNORED_WORDS{'en'} =3D '<p>The following words are either too short or =
very common and were=0A=
        not included in your search: <strong><WORDS></strong></p>';=0A=
$IGNORED_WORDS{'de'} =3D '<p>Folgende W=F6rter sind zu kurz oder kommen =
sehr h=E4ufig vor und wurden =0A=
        daher in Ihrer Suchanfrage ignoriert: <strong><WORDS></strong></p>';=0A=
# fixme: "too short" missing:=0A=
$IGNORED_WORDS{'fr'} =3D '<p>Les mots suivants sont tr=E8s courants et =
n\'ont =0A=
        pas =E9t=E9 inclus dans votre recherche: <strong><WORDS></strong></p>';=0A=
# fixme: "too short" missing:=0A=
$IGNORED_WORDS{'it'} =3D '<p>Le seguenti parole sono molto comuni e non=0A=
    saranno incluse nella vostra ricerca: <strong><WORDS></strong></p>';=0A=
=0A=
#########################################################################=
##=0A=
### You shouldn't have to edit anything below this line.=0A=
=0A=
# Various paths (do NOT use system-wide /tmp for security reasons!)=0A=
$TMP_DIR  =3D $INSTALL_DIR.'temp/';=0A=
$DATA_DIR =3D $INSTALL_DIR.'data/';=0A=
$CONF_DIR =3D $INSTALL_DIR."conf/";=0A=
$STOPWORDS_FILE =3D $CONF_DIR.'stopwords.txt';=0A=
$NO_INDEX_FILE =3D $CONF_DIR.'no_index.txt';=0A=
$LOGFILE =3D $DATA_DIR.'log.txt';=0A=
$SEARCH =3D 'search.pl';=0A=
$SEARCH_URL =3D $CGIBIN.$SEARCH;=0A=
$UPDATE_FILE =3D $DATA_DIR.'update';=0A=
=0A=
# Paths to the database files.=0A=
$INV_INDEX_DB_FILE =3D $DATA_DIR.'inv_index';=0A=
$DOCS_DB_FILE      =3D $DATA_DIR.'docs';=0A=
$URLS_DB_FILE      =3D $DATA_DIR.'urls';=0A=
$SIZES_DB_FILE     =3D $DATA_DIR.'sizes';=0A=
$TERMS_DB_FILE     =3D $DATA_DIR.'terms';=0A=
$DF_DB_FILE        =3D $DATA_DIR.'df';=0A=
$TF_DB_FILE        =3D $DATA_DIR.'tf';=0A=
$CONTENT_DB_FILE   =3D $DATA_DIR.'content';=0A=
$DESC_DB_FILE      =3D $DATA_DIR.'desc';=0A=
$TITLES_DB_FILE    =3D $DATA_DIR.'titles';=0A=
$DATES_DB_FILE     =3D $DATA_DIR.'dates';=0A=
=0A=
# Paths to the temporary database files.=0A=
$INV_INDEX_TMP_DB_FILE =3D $DATA_DIR.'inv_index_tmp';=0A=
$DOCS_TMP_DB_FILE      =3D $DATA_DIR.'docs_tmp';=0A=
$URLS_TMP_DB_FILE      =3D $DATA_DIR.'urls_tmp';=0A=
$SIZES_TMP_DB_FILE     =3D $DATA_DIR.'sizes_tmp';=0A=
$TERMS_TMP_DB_FILE     =3D $DATA_DIR.'terms_tmp';=0A=
$CONTENT_TMP_DB_FILE   =3D $DATA_DIR.'content_tmp';=0A=
$DESC_TMP_DB_FILE      =3D $DATA_DIR.'desc_tmp';=0A=
$TITLES_TMP_DB_FILE    =3D $DATA_DIR.'titles_tmp';=0A=
$DATES_TMP_DB_FILE     =3D $DATA_DIR.'dates_tmp';=0A=
=0A=
# Official version number.=0A=
$VERSION =3D "3.31b";=0A=
1;=0A=

------=_NextPart_000_0042_01C2F53D.3F0C3510--