From yh at senate.be Wed Dec 1 09:13:31 2004 From: yh at senate.be (Yves Hanotiau) Date: Wed Dec 1 13:37:26 2004 Subject: [Perlfect-search] multiple templates for the search result page Message-ID: <3A990DB4C59FD411899B00D0B76961B601D12C5C@mailoffice.senate.be> Hello, I'm using Perlfect Search 3.31. Is it possible to have different templates for the "search result" page ? Thanks in advance Yves HANOTIAU From david.wessel at gwi-ag.com Wed Dec 1 14:55:02 2004 From: david.wessel at gwi-ag.com (David Wessel) Date: Wed Dec 1 14:55:12 2004 Subject: [Perlfect-search] Search result misbehavior via http / fine-working under console Message-ID: <53811A2271D0D41194120002B31FD38B051D8771@exch01trier.trier.gwi> Hello, I have successfully installed perlfect under win-xp pro (with xampp and xampp-perl-addon) and indexed some files. Running "perl search.pl [word]" under the command prompt returns apropriate search results. There is however a problem with the display of search results using the web-interface. It would return search results for the _first_ query but not more. Entering another search-word will not change the results. When checking the error.log it says: [Wed Dec 1 15:04:57 2004] search.pl: Variable "$query" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 110. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@force" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 115. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@not" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 116. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@other" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 117. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@docs" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 118. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@valid_docs" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 119. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%answer" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 120. [Wed Dec 1 15:04:57 2004] search.pl: Variable "$query" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 141. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%urls_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 144. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@stopwords" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 237. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@stopwords_ignored" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 248. [Wed Dec 1 15:04:57 2004] search.pl: Variable "$query" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 255. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%terms_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 279. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@force" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 280. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@not" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 285. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@other" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 287. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@not" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 305. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%inv_index_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 306. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@force" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 312. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@valid_docs" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 315. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%docs_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 323. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@force" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 330. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@other" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 330. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@valid_docs" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 333. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%inv_index_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 336. [Wed Dec 1 15:04:57 2004] search.pl: Variable "$query" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 339. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%docs_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 341. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%answer" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 346. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%dates_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 349. [Wed Dec 1 15:04:57 2004] search.pl: Variable "$query" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 367. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%answer" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 380. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%docs_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 400. [Wed Dec 1 15:04:57 2004] search.pl: Variable "@stopwords_ignored" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 415. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%titles_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 449. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%dates_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 486. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%sizes_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 505. [Wed Dec 1 15:04:57 2004] search.pl: Variable "$query" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 587. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%answer" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 588. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%content_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 637. [Wed Dec 1 15:04:57 2004] search.pl: Variable "%desc_db" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 640. [Wed Dec 1 15:04:57 2004] search.pl: Variable "$punct" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 685. [Wed Dec 1 15:04:57 2004] search.pl: Variable "$punct" will not stay shared at C:/Programme/xampp/htdocs/search/search.pl line 726. [Wed Dec 1 15:04:57 2004] tools.pl: Argument "O_RDONLY" isn't numeric in subroutine entry at C:/Programme/xampp/perl/site/lib/DB_File.pm line 278. the last line is repeated a couple of times whenever I trigger a new search. Any ideas? Sorry, it's not a public site for you to see it. greetings, David -------------------------------------------- David Wessel Auszubildender Fachinformatik - Anwendungsentwicklung GWI Research GmbH Fachgruppe Kommunikation / Schnittstellen Monaiser Stra?e 11, 54294 Trier Tel.: 0651 / 8247 - 0 Fax.: 0651 / 8247 - 100 GWI SST - Hotline 01805 / 494483 http://www.gwi-ag.com david.wessel@gwi-ag.com -------------------------------------------- From daniel.naber at t-online.de Wed Dec 1 18:58:07 2004 From: daniel.naber at t-online.de (Daniel Naber) Date: Wed Dec 1 18:57:15 2004 Subject: [Perlfect-search] multiple templates for the search result page In-Reply-To: <3A990DB4C59FD411899B00D0B76961B601D12C5C@mailoffice.senate.be> References: <3A990DB4C59FD411899B00D0B76961B601D12C5C@mailoffice.senate.be> Message-ID: <200412011958.08063@danielnaber.de> On Wednesday 01 December 2004 10:13, Yves Hanotiau wrote: > I'm using Perlfect Search 3.31. > Is it possible to have different templates for the "search result" page > ? You can use the language feature to define several templates: $SEARCH_TEMPLATE{'en'} = $TEMPLATE_DIR.'search.html'; $SEARCH_TEMPLATE{'de'} = $TEMPLATE_DIR.'search_de.html'; The 'lang' parameter will then be used to select a template. Regards Daniel -- http://www.danielnaber.de From daniel.naber at t-online.de Wed Dec 1 18:59:42 2004 From: daniel.naber at t-online.de (Daniel Naber) Date: Wed Dec 1 18:58:48 2004 Subject: [Perlfect-search] Search result misbehavior via http / fine-working under console In-Reply-To: <53811A2271D0D41194120002B31FD38B051D8771@exch01trier.trier.gwi> References: <53811A2271D0D41194120002B31FD38B051D8771@exch01trier.trier.gwi> Message-ID: <200412011959.42372@danielnaber.de> On Wednesday 01 December 2004 15:55, David Wessel wrote: > [Wed Dec ?1 15:04:57 2004] search.pl: Variable "$query" will not stay > shared at C:/Programme/xampp/htdocs/search/search.pl line 110. You probably have some CGI speed up installed (mod_perl?) that you need to deactivate in the web server. Regards Daniel -- http://www.danielnaber.de From david.wessel at gwi-ag.com Thu Dec 2 14:22:32 2004 From: david.wessel at gwi-ag.com (David Wessel) Date: Thu Dec 2 14:23:01 2004 Subject: [Perlfect-search] indexing a site // index-size Message-ID: <53811A2271D0D41194120002B31FD38B051D8776@exch01trier.trier.gwi> Hello, I have indexed a filesystem containing ~8000 files of different sizes, most of them under 80k I was wondering which $CONTEXT_SIZE will make sense... right now i use 0 and it works fine and the search is fast (index under 25 Megs). Yet, I need more detailed search results. Has anyone tampered with this value / can anyone give recommendations? thanks, David -------------------------------------------- David Wessel Auszubildender Fachinformatik - Anwendungsentwicklung GWI Research GmbH Fachgruppe Kommunikation / Schnittstellen Monaiser Stra?e 11, 54294 Trier Tel.: 0651 / 8247 - 0 Fax.: 0651 / 8247 - 100 GWI SST - Hotline 01805 / 494483 http://www.gwi-ag.com david.wessel@gwi-ag.com -------------------------------------------- From roger_growe at earthlink.net Mon Dec 6 21:18:50 2004 From: roger_growe at earthlink.net (Roger Growe) Date: Mon Dec 6 21:18:55 2004 Subject: [Perlfect-search] Problems with indexing Message-ID: <001901c4dbd9$2eb8b6a0$0f02a8c0@URSA> Problems with indexing I 'd really appreciate a hand with this script. After working though the instructions, all the correspondence in the mail list archives and other sources I am still getting this when starting the index: Using DB_File... Checking for old temp files... Building string of special characters... Loading 'no index' regular expressions: - frontpage2.html - frontpage.html [etc.] Loading stopwords...371 stopwords loaded. Starting crawler... Note: I will not visit more than $HTTP_MAX_PAGES=150 pages. Loading http://www.quinacrine.com/robots.txt... Error: Couldn't get 'http://www.quinacrine.com/robots.txt': response code 500 Not using any robots.txt. Error: Couldn't get 'http://www.quinacrine.com/index.html': response code 500 Crawler finished: indexed 0 files, 0 terms (0 different terms). Ignored 0 files because of conf/no_index.txt Ignored 0 files because of robots.txt I thought it might be the structure of the site that was the problem. The pages are not in the root but in the 'web' directory, like so: root .config .sessions cgi-bin logs web In cgi-bin I have these: searchsite conf data Perlfect temp templates I installed manually by necessity and need to index though http for the same reason. All syntax, permissions, and other rules that I can find check out. Unix server. The main sections of config.pl look like this now: $DOCUMENT_ROOT = 'http://www.quinacrine.com/'; # The base url of your site (normally that's the URL which # corresponds to $DOCUMENT_ROOT). $BASE_URL = 'http://www.quinacrine.com'; # The url in which Perlfect Search is located (usually somewhere in cgi-bin/). $CGIBIN = "/cgi-bin/searchsite/"; # The full-path of the directory where Perlfect Search is installed. $INSTALL_DIR = '/nfs/cust/5/80/46/564085/cgi-bin/searchsite/'; # Only files with these extensions should be indexed (case-sensitive). # This is only relevant for file system indexing, when you index files via # http you need to set @HTTP_CONTENT_TYPES instead. [re-index] @EXT = ("html", "htm", "shtml", "txt"); [Password section] ########################################################################### ### http configuration ### You only need this if you want to index your pages via http # Where you want the indexer to start via http. Leave empty if # you want to index the files in the filesystem ($DOCUMENT_ROOT). # ** WARNING **: Do not use for foreign servers! It might use too many # resources on other people's servers. [re-index] # example: $HTTP_START_URL = 'http://localhost/'; $HTTP_START_URL = 'http://www.quinacrine.com/index.html'; Thinking that the file structure could be the issue, I put a copy of robots.txt in the root, still the 500 response. I've left $HTTP_START_URL = blank, used 'http://www.quinacrine.com/' as well as other things I could think of to break this jam. Thanks for your help in advance, Roger Growe -------------- next part -------------- An HTML attachment was scrubbed... URL: http://hottub.perlfect.com/pipermail/perlfect-search/attachments/20041206/2eccfe62/attachment.html From daniel.naber at t-online.de Mon Dec 6 21:59:50 2004 From: daniel.naber at t-online.de (Daniel Naber) Date: Mon Dec 6 21:58:47 2004 Subject: [Perlfect-search] Problems with indexing In-Reply-To: <001901c4dbd9$2eb8b6a0$0f02a8c0@URSA> References: <001901c4dbd9$2eb8b6a0$0f02a8c0@URSA> Message-ID: <200412062259.50749@danielnaber.de> On Monday 06 December 2004 22:18, Roger Growe wrote: > $HTTP_START_URL = 'http://www.quinacrine.com/index.html'; Only URLs below this will be indexed, so either remove the "index.html" or set @HTTP_LIMIT_URLS = ("http://www.quinacrine.com/"); Regards Daniel -- http://www.danielnaber.de From mandiv at corp.untd.com Tue Dec 7 09:58:37 2004 From: mandiv at corp.untd.com (Maninder, Singh) Date: Tue Dec 7 09:58:49 2004 Subject: [Perlfect-search] RE: perlfect-search digest, Vol 1 #587 - 1 msg Message-ID: <4D8B620F4FDA414982B23A3E7F6A8EF1037D140A@hydmail01.hyd.corp.int.untd.com> I keep getting the following error: >>Can't locate object method "TIEHASH" via package "" at /blah/cgi-bin/perlfect/search/search_prod.pl line 76 Does anyone know the reason for this? From daniel.naber at t-online.de Tue Dec 7 18:46:46 2004 From: daniel.naber at t-online.de (Daniel Naber) Date: Tue Dec 7 18:45:42 2004 Subject: [Perlfect-search] RE: perlfect-search digest, Vol 1 #587 - 1 msg In-Reply-To: <4D8B620F4FDA414982B23A3E7F6A8EF1037D140A@hydmail01.hyd.corp.int.untd.com> References: <4D8B620F4FDA414982B23A3E7F6A8EF1037D140A@hydmail01.hyd.corp.int.untd.com> Message-ID: <200412071946.47274@danielnaber.de> On Tuesday 07 December 2004 10:58, Maninder, Singh wrote: > >>Can't locate object method "TIEHASH" via package "" at > >> /blah/cgi-bin/perlfect/search/search_prod.pl line 76 > > Does anyone know the reason for this? Looks like DB_File isn't installed. Regards Daniel -- http://www.danielnaber.de From roger_growe at earthlink.net Tue Dec 7 19:22:31 2004 From: roger_growe at earthlink.net (Roger Growe) Date: Tue Dec 7 19:22:39 2004 Subject: [Perlfect-search] Problems with indexing Message-ID: <004b01c4dc92$1b2a9d10$0f02a8c0@URSA> Thanks for your quick reply. I made the changes below and got the exact same response. $DOCUMENT_ROOT = 'http://www.quinacrine.com/'; $BASE_URL = 'http://www.quinacrine.com'; $CGIBIN = "/cgi-bin/searchsite/"; $INSTALL_DIR = '/nfs/cust/5/80/46/564085/cgi-bin/searchsite/'; @EXT = ("html", "htm", "shtml", "txt"); $INDEXER_CGI_PASSWORD = $HTTP_START_URL = 'http://www.quinacrine.com/'; $HTTP_MAX_PAGES = 150; $HTTP_SERVER_ROOT = $DOCUMENT_ROOT; @HTTP_LIMIT_URLS = 'http://www.quinacrine.com/'; Thanks again, any suggestions? Roger Growe ----- Original Message ----- From: "Daniel Naber" To: "Roger Growe" Cc: Sent: Monday, December 06, 2004 4:59 PM Subject: Re: [Perlfect-search] Problems with indexing > On Monday 06 December 2004 22:18, Roger Growe wrote: > > > $HTTP_START_URL = 'http://www.quinacrine.com/index.html'; > > Only URLs below this will be indexed, so either remove the "index.html" or > set > @HTTP_LIMIT_URLS = ("http://www.quinacrine.com/"); > > Regards > Daniel > > -- > http://www.danielnaber.de > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://hottub.perlfect.com/pipermail/perlfect-search/attachments/20041207/d1c756cb/attachment.html From Jeramiah.Bowling at afb-a.com Tue Dec 7 20:37:29 2004 From: Jeramiah.Bowling at afb-a.com (Jeramiah M. Bowling) Date: Tue Dec 7 20:37:38 2004 Subject: [Perlfect-search] Windows 2003 - CGI Misbehaved Message-ID: Can anyone help? I have installed perlfect search and indexed successfully on a Windows 2003 box. I have allowed perl extensions, CGI extensions, and ISAPI extensions. I have given the anonymous user read and execute access at both the IIS level and the NTFS level. I get the following error when I go to search.pl through a browser: The specified CGI application misbehaved by not returning a complete set of HTTP headers. I have searched through the forums to no avail. My perl install is off of the root of c:\, but as I said I installed and indexed with no problem. In my search.pl I have added the location of my perl on the top line as: #!c:/perl ******************************************* Below is my conf.pl file: ******************************************* # Perlfect Search configuration file #$rcs = ' $Id: conf.pl,v 1.64 2003/02/24 21:10:16 daniel Exp $ ' ; # NOTE: Whenever you change one of the options that's marked with [re-index] # you need to run indexer.pl again to make the change take effect. ########################################################################### ### basic configuration ### You'll have to adapt these values if you didn't use setup.pl # Where do you want the indexer to start on your disk? # ** Note ** : If your files are generated dynamically (e.g. via PHP) # you should set $HTTP_START_URL (see below), otherwise users # will be able to see your pages' source code using the # "highlight matches" link. # [re-index] $DOCUMENT_ROOT = 'c:/inetpub/wwwroot'; # The base url of your site (normally that's the URL which # corresponds to $DOCUMENT_ROOT). $BASE_URL = 'http://afbonline.afb-a.com'; # The url in which Perlfect Search is located (usually somewhere in cgi-bin/). $CGIBIN = '/bin/perlfect/search/'; # The full-path of the directory where Perlfect Search is installed. $INSTALL_DIR = 'c:/inetpub/wwwroot/bin/perlfect/search/'; # Only files with these extensions should be indexed (case-sensitive). # This is only relevant for file system indexing, when you index files via # http you need to set @HTTP_CONTENT_TYPES instead. [re-index] @EXT = ("htm","html","shtml","asp","txt","pdf","doc","xls"); # If you do not have telnet/ssh access to the server that runs the script, you # need to execute indexer.pl via CGI. Of course not everybody should be able # to do that, so set a password with this option. # ** Note ** : Only use this if absolutely necessary! Setting to "" disables # execution as a CGI, which is much more secure. Note that other people on # your server can probably read this file and look up your password. $INDEXER_CGI_PASSWORD = ""; ########################################################################### ### http configuration ### You only need this if you want to index your pages via http # Where you want the indexer to start via http. Leave empty if # you want to index the files in the filesystem ($DOCUMENT_ROOT). # ** WARNING **: Do not use for foreign servers! It might use too many # resources on other people's servers. [re-index] # example: $HTTP_START_URL = 'http://localhost/'; $HTTP_START_URL = ''; # The indexer might not notice if it runs into an endless loop. To void # that, set this to the maximum number of pages that will be visited # (this can be bigger than the number of pages indexed). [re-index] $HTTP_MAX_PAGES = 100; # The web server's document root. Normally that's the same as $DOCUMENT_ROOT, # it differs if you're only using Perlfect Search on a subdirectory. [re-index] $HTTP_SERVER_ROOT = $DOCUMENT_ROOT; # Limit crawling to these URL pattern. This is an important setting so # the script doesn't run out of control. # ** WARNING **: The default ($HTTP_START_URL) should not be changed, # otherwise you risk the script to crawl on remote servers. For example, # the robots.txt file will only be used on the $HTTP_START_URL server! # [re-index] @HTTP_LIMIT_URLS = ($HTTP_START_URL); # Comment this out if you want to ignore robots.txt (only do that if # you really know what you are doing): $ROBOT_AGENT = 'perlfectsearch'; # Should the indexer follow links that are commented out? $HTTP_FOLLOW_COMMENT_LINKS = 1; # Only if indexing via http: the content types to index. # Add 'application/msword' for for MS-Word, # 'application/pdf' for PDF. [re-index] @HTTP_CONTENT_TYPES = ('text/html', 'text/plain'); # Set to 1 to get verbose output during indexing. [re-index] $HTTP_DEBUG = 1; ########################################################################### ### advanced configuration ### You only need this if you want to adapt advanced features # Programs that convert other formats to ascii text. # The name of the file to be filtered is passed as FILENAME, and the command # must print out ascii (or latin1) text. # pdftotext is part of xpdf, available at # http://www.foolabs.com/xpdf/download.html # antiword is available at http://www.winfield.demon.nl/ # NOTE: You also have to set @EXT or @HTTP_CONTENT_TYPES accordingly. # If there's a problem with pdftotext, try a new version or hand over # the -raw option to pdftotext. # [re-index] %EXT_FILTER = ( "pdf" => "/usr/bin/pdftotext FILENAME -", "doc" => "/usr/bin/antiword FILENAME" ); # How many results should be shown per page. $RESULTS_PER_PAGE = 5; # Limit the number of results. 0 = no limit. $MAX_RESULTS = 0; # Enable the "highlight matches" feature that displays the original # pages, but with the search terms highlighted. See the README on # restrictions of this feature. $HIGHLIGHT_MATCHES = 1; # A "highlight matches" link does only work for HTML files, so only # offer such a link for files with these suffixes. # ** Note **: If $HTTP_START_URL is not set, the highlighting # will load the file from disk so that the user might find # passwords in the highlightes file! So don't set to include # dynamic file, unless you are using $HTTP_START_URL. @HIGHLIGHT_EXT = ("html", "htm"); # Perlfect Search can highlight the search terms in the matching # document. These are the colors that will be used for the background # of the terms (the browser must support CSS for this). If the last color # is used, the first one will be used again if there are still terms left. @HIGHLIGHT_COLORS = ('#4fafea', '#e5b547', '#aaaaaa', '#ee77ee'); # Show the ranking in percent, with the first document = 100%. $PERCENTAGE_RANKING = 1; # Do you want to index numbers? If so set $INDEX_NUMBERS to 1. [re-index] $INDEX_NUMBERS = 0; # If you don't have enough memory, set this to 1. This will slow down # indexer.pl by a factor of about 2. Searching is not affected. $LOW_MEMORY_INDEX = 1; # How much of the document should be put in the index? With this option, # the context of the match is shown on the results page. This only works # if the match was in the first $CONTEXT_SIZE bytes of the document. # Warning: Using this option will generate a very big index file. # Set to 0 to disable, set to -1 for no limit. [re-index] $CONTEXT_SIZE = 0; # If $CONTEXT_SIZE is enabled, how many occurences of every term should be shown # on the results page? $CONTEXT_EXAMPLES = 2; # If $CONTEXT_SIZE is enabled, how many words should be used to show the context # of a term? $CONTEXT_DESC_WORDS = 12; # How many words should be used from the of an html document as a # description for the document in case there is no tag # available and $CONTEXT_SIZE is 0. [re-index] $DESC_WORDS = 25; # The minimum length of a word. Any word of smaller size is not indexed. # [re-index] $MINLENGTH = 3; # If you have umlauts or accents etc. in your text, enable this. # With this option accented characters will be indexed as the characters # they are based on (e.g. ? -> e, ? -> u), without this option they will # be filtered out completely (you don't want that). [re-index] $SPECIAL_CHARACTERS = 1; # The largest acceptable word size. Reducing this saves space but decreases # result accuracy. Setting the variable to 0 ignores stemming alltogether. # [re-index] $STEMCHARS = 0; # Add URLs to the index, so one can search for them? Note that special # characters will be ignored, just as in normal text. [re-index] $INDEX_URLS = 0; # You can completely ignore certain parts of your documents if you put these # HTML comments around them. [re-index] $IGNORE_TEXT_START = ''; $IGNORE_TEXT_END = ''; # The maximum length of elements, everything longer than this # will be cut off. [re-index] $MAX_TITLE_LENGTH = 80; # How much more important are words found in the title, in the meta values # (author, description, keywords), and in the headlines compared to normal # text in the body? This influences the ranking of the results. # Use any integer (0 = ignore that text completely) [re-index] $TITLE_WEIGHT = 5; $META_WEIGHT = 5; $H_WEIGHT{'1'} = 5; # headline <h1>...</h1> $H_WEIGHT{'2'} = 4; $H_WEIGHT{'3'} = 3; $H_WEIGHT{'4'} = 1; $H_WEIGHT{'5'} = 1; $H_WEIGHT{'6'} = 1; # headline <h6>...</h6> # If you want to log the queries to an extra file, set this to 1. # Every use of search.pl will then be logged to data/log.txt. That file # has to exist and must be writable for the webserver. The line format is: # REMOTE_HOST;date;terms;matches;current page;(time to search in seconds); # NOTE: You'll have to comment in two lines at the top of search.pl to get the # time value (see the comment there). # NOTE: if you have many queries, this file will grow quite fast. $LOG = 0; # This will increase the score of results that contain more than one of # the searched terms. Queries with only one term will not be affected. # The number given here is a factor that multiplies the score (even # several times, if there are more than two terms). 0 turns it off. $MULTIPLE_MATCH_BOOST = 0; # Date format for the result page. %Y = year, %m = month, %d = day, # %H = hour, %M = minute, %S = second. On a Unix system use # 'man strftime' to get a list of all possible options. $DATE_FORMAT = "%Y-%m-%d"; # Date format for the "Latest Index update" information on the result page. $INDEX_DATE_FORMAT = "%Y-%m-%d %H:%M"; # Directory with templates (normally you don't have to modify this). $TEMPLATE_DIR = $INSTALL_DIR.'templates/'; # What's the default language. This is the language that's used if no lang # parameter is passed to the script or if the parameter is invalid. $DEFAULT_LANG = 'en'; # The result templates for several languages. $SEARCH_TEMPLATE{'en'} = $TEMPLATE_DIR.'search.html'; $SEARCH_TEMPLATE{'de'} = $TEMPLATE_DIR.'search_de.html'; $SEARCH_TEMPLATE{'fr'} = $TEMPLATE_DIR.'search_fr.html'; $SEARCH_TEMPLATE{'it'} = $TEMPLATE_DIR.'search_it.html'; $NO_MATCH_TEMPLATE{'en'} = $TEMPLATE_DIR.'no_match.html'; $NO_MATCH_TEMPLATE{'de'} = $TEMPLATE_DIR.'no_match_de.html'; $NO_MATCH_TEMPLATE{'fr'} = $TEMPLATE_DIR.'no_match_fr.html'; $NO_MATCH_TEMPLATE{'it'} = $TEMPLATE_DIR.'no_match_it.html'; # This is the template for using search.pl via command line: $SEARCH_TEMPLATE{'text'} = $TEMPLATE_DIR.'search.txt'; $NO_MATCH_TEMPLATE{'text'} = $TEMPLATE_DIR.'no_match.txt'; # This is the template for using the test cases (development only): $SEARCH_TEMPLATE{'qa'} = $INSTALL_DIR.'qa/search_qa.txt'; $NO_MATCH_TEMPLATE{'qa'} = $INSTALL_DIR.'qa/no_match_qa.txt'; # The text for the "Next Page" link in several languages. $NEXT_PAGE{'en'} = 'Next'; $NEXT_PAGE{'de'} = 'nächste Seite'; $NEXT_PAGE{'fr'} = 'Suivant'; $NEXT_PAGE{'it'} = 'Successiva'; # The text for the "Previous Page" link in several languages. $PREV_PAGE{'en'} = 'Previous'; $PREV_PAGE{'de'} = 'vorige Seite'; $PREV_PAGE{'fr'} = 'Pr?c?dent'; $NEXT_PAGE{'it'} = 'Precedente'; # Text of the link that shows a colored backround for matched terms: $HIGHLIGHT_TERMS{'en'} = 'highlight matches'; $HIGHLIGHT_TERMS{'de'} = 'Treffer hervorheben'; # The text for the "too common" warning. <WORDS> will be replaced with # a list of the ignored words. If there are no ignored words, this text # will not appear. $IGNORED_WORDS{'en'} = '<p>The following words are either too short or very common and were not included in your search: <strong><WORDS></strong></p>'; $IGNORED_WORDS{'de'} = '<p>Folgende W?rter sind zu kurz oder kommen sehr h?ufig vor und wurden daher in Ihrer Suchanfrage ignoriert: <strong><WORDS></strong></p>'; # fixme: "too short" missing: $IGNORED_WORDS{'fr'} = '<p>Les mots suivants sont tr?s courants et n\'ont pas ?t? inclus dans votre recherche: <strong><WORDS></strong></p>'; # fixme: "too short" missing: $IGNORED_WORDS{'it'} = '<p>Le seguenti parole sono molto comuni e non saranno incluse nella vostra ricerca: <strong><WORDS></strong></p>'; ########################################################################### ### You shouldn't have to edit anything below this line. # Various paths (do NOT use system-wide /tmp for security reasons!) $TMP_DIR = $INSTALL_DIR.'temp/'; $DATA_DIR = $INSTALL_DIR.'data/'; $CONF_DIR = $INSTALL_DIR."conf/"; $STOPWORDS_FILE = $CONF_DIR.'stopwords.txt'; $NO_INDEX_FILE = $CONF_DIR.'no_index.txt'; $LOGFILE = $DATA_DIR.'log.txt'; $SEARCH = 'search.pl'; $SEARCH_URL = $BIN.$SEARCH; $UPDATE_FILE = $DATA_DIR.'update'; # Paths to the database files. $INV_INDEX_DB_FILE = $DATA_DIR.'inv_index'; $DOCS_DB_FILE = $DATA_DIR.'docs'; $URLS_DB_FILE = $DATA_DIR.'urls'; $SIZES_DB_FILE = $DATA_DIR.'sizes'; $TERMS_DB_FILE = $DATA_DIR.'terms'; $DF_DB_FILE = $DATA_DIR.'df'; $TF_DB_FILE = $DATA_DIR.'tf'; $CONTENT_DB_FILE = $DATA_DIR.'content'; $DESC_DB_FILE = $DATA_DIR.'desc'; $TITLES_DB_FILE = $DATA_DIR.'titles'; $DATES_DB_FILE = $DATA_DIR.'dates'; # Paths to the temporary database files. $INV_INDEX_TMP_DB_FILE = $DATA_DIR.'inv_index_tmp'; $DOCS_TMP_DB_FILE = $DATA_DIR.'docs_tmp'; $URLS_TMP_DB_FILE = $DATA_DIR.'urls_tmp'; $SIZES_TMP_DB_FILE = $DATA_DIR.'sizes_tmp'; $TERMS_TMP_DB_FILE = $DATA_DIR.'terms_tmp'; $CONTENT_TMP_DB_FILE = $DATA_DIR.'content_tmp'; $DESC_TMP_DB_FILE = $DATA_DIR.'desc_tmp'; $TITLES_TMP_DB_FILE = $DATA_DIR.'titles_tmp'; $DATES_TMP_DB_FILE = $DATA_DIR.'dates_tmp'; # Official version number. $VERSION = "3.31b"; 1; Thanks in advance, Jeramiah Bowling From daniel.naber at t-online.de Tue Dec 7 20:45:15 2004 From: daniel.naber at t-online.de (Daniel Naber) Date: Tue Dec 7 20:44:12 2004 Subject: [Perlfect-search] Windows 2003 - CGI Misbehaved In-Reply-To: <EBEAFF3FACCCA5429D8AC3FC556C972841BDA9@afbexchange.afb-a.local> References: <EBEAFF3FACCCA5429D8AC3FC556C972841BDA9@afbexchange.afb-a.local> Message-ID: <200412072145.16008@danielnaber.de> On Tuesday 07 December 2004 21:37, Jeramiah M. Bowling wrote: > I get the following error when I go to search.pl through a browser: > > The specified CGI application misbehaved by not returning a complete set > of HTTP headers. Comment in this line in search.pl: #use CGI::Carp qw(fatalsToBrowser); You should get a better error message then. If that doesn't help, have a look at the server's error log. Regards Daniel -- http://www.danielnaber.de From david.wessel at gwi-ag.com Mon Dec 13 13:40:00 2004 From: david.wessel at gwi-ag.com (David Wessel) Date: Mon Dec 13 13:40:23 2004 Subject: [Perlfect-search] moving index from nt to unix Message-ID: <53811A2271D0D41194120002B31FD38B051D878D@exch01trier.trier.gwi> Hello, I want to move the index from my NT machine to a linux box but it fails to open it. Why would I want to do this? I am indexing a clearcase-vob that cannot be accessed from linux, therefore i need to do the indexing on the NT machine. Is there some other way of doing this? search.pl says: Cannot open /usr/local/httpd/htdocs/search/data/inv_index: at search.pl line 76. (the file exists and the permissions are o.k.) Thanks for any help, David -------------------------------------------- David Wessel Auszubildender Fachinformatik - Anwendungsentwicklung GWI Research GmbH Fachgruppe Kommunikation / Schnittstellen Monaiser Stra?e 11, 54294 Trier Tel.: 0651 / 8247 - 0 Fax.: 0651 / 8247 - 100 GWI SST - Hotline 01805 / 494483 http://www.gwi-ag.com david.wessel@gwi-ag.com -------------------------------------------- From pokerup2001 at yahoo.com Tue Dec 14 13:53:21 2004 From: pokerup2001 at yahoo.com (david groeling) Date: Tue Dec 14 13:53:24 2004 Subject: [Perlfect-search] Start-Url Problems Message-ID: <20041214135321.33257.qmail@web14716.mail.yahoo.com> Im having a heck of a time getting http search to work. I have read the documentation over and over and have tried numerous if not all of the variable settings that i can imagine. Here are current and most accurate settings that i use. With these settings i can do all but index from the web. I would like to index from the web as of the use of active page's such as .php content. $DOCUMENT_ROOT = '/www/Apache2/htdocs/'; $BASE_URL = 'https://your-domain.org:8082/'; I know this is the setting that sets the actual link specfication from your search page. All my links point to the proper page and server protocol HTTPS. $CGIBIN = '/cgi-bin/perlfect/search/'; $INSTALL_DIR = '/www/apache2/cgi-bin/perlfect/search/'; @EXT = ("htm","html","shtml","xml","php"); $INDEXER_CGI_PASSWORD = "MY-PASSWORD"; Now in order for all this to work properly i can not use the http start url wich i really would like to. As stated i have tried every possible variable i can think of. The best results i have gotten are 1 page scanned 0 pages indexed 0 content indexed. When i use the start HTTP url varaible. $HTTP_START_URL = ''; Are there any problems with this on my side that anyone can see? Are there any problems in the code that prevents this option from working properly? I have come to like this program i have actually looked for quite a long time for a good search engine i think i have found it here. This is very good work and i must appluad you all for doing such a good job. Well Merry Christmas all PS: Santa at Perlfect Search please respond with all positive answers to my ?s Oh yeah and dont forget to look under the tree i have left you something there again this year. __________________________________ Do you Yahoo!? Yahoo! Mail - Easier than ever with enhanced search. Learn more. http://info.mail.yahoo.com/mail_250 From daniel.naber at t-online.de Tue Dec 14 18:22:08 2004 From: daniel.naber at t-online.de (Daniel Naber) Date: Tue Dec 14 18:20:57 2004 Subject: [Perlfect-search] Start-Url Problems In-Reply-To: <20041214135321.33257.qmail@web14716.mail.yahoo.com> References: <20041214135321.33257.qmail@web14716.mail.yahoo.com> Message-ID: <200412141922.08835@danielnaber.de> On Tuesday 14 December 2004 14:53, david groeling wrote: > $BASE_URL = 'https://your-domain.org:8082/'; Does that page demand a password? https alone doesn't seem to be a problem, I just managed to index some pages from a public https server (starting at https://secure.manning.com/login.php). What exactly is the output you get when you try to index? Regards Daniel -- http://www.danielnaber.de From pokerup2001 at yahoo.com Wed Dec 15 13:56:03 2004 From: pokerup2001 at yahoo.com (david groeling) Date: Wed Dec 15 13:56:08 2004 Subject: [Perlfect-search] RE: Http start url Message-ID: <20041215135603.30473.qmail@web14703.mail.yahoo.com> I can get it to scan but it scans 1 file and indexes 0 files 0 content as stated in previous message. I have tried various options on how to do this Ive changed all settings over and over again in a trial and error situation. I got a message from one of the others subscribers to perlfect search. Suggesting to set only the " limit the URL" setting. OK Ive done that and left the start URL blank. I can get this to scan like that, but it seems really not to work and scans as if its on your local file system not on or via the web. Let me try to see if i can get some more information while explaining what it is i understand from the install and setup instructions. So i read the instructions and it suggests that if you use dynamic pages such as asp or PhP files you will not get the content from those pages alone you will also get the code behind those pages. So the instructions say that if you want to index files that are active like asp or PhP. You must use the HTTP start URL setting. Now i can confirm that if i use the local file system scan that i truly do indeed get the page code indexed. That is not a good option at all for others to be looking at code on pages rather then content. So it all comes down to how to index the php, asp pages without getting the code indexed. Like i said i have read reread the instructions at least 100 times and tried every available option in the settings to get active pages to index correctly. So after looking at the archives for quite some time over the last 5 days. I see that others have similar problems with start URL. I think but not sure that a true example of a working set up would be needed. One that people can see exactly how the code is set up. I know that there are generic examples on one of the instruction page's. But they are set up in a way that is not like the conf.pl file. They are separated with white spaces and the generic code examples are not working examples. On to the second part of this. If Indexing php or asp pages are not truly done rather meaning that if only HTML content is pulled from php or asp pages or page code. I'm not sure that the instructions go into that detail well enough. As many php or asp pages pull there content from either databases or XML or other sources for there content. Is it really possible to index a php or asp page correctly? On a side note but may be relating to part of either my misinterpretation of how this is to work. If you index files with say images on them. And you do a search for say flowers on the search page and naturally knowing that there are flowers indexed in your pages. Your search page will come up with several pages of options to view. Now these pages with flowers in them also have images either embedded in them or linked to them. If you view the pages normally you will see the flower pictures. If you look at the pages through the highlight text option for flowers on those pages no images show up. So as with a php or asp page that pulls its data from other sources so do regular HTML pages if images or other type of content is on those pages. But they will not show up. I have tried to index image files jpg and gif to see if that would correct the problem but it does not correct that issue. So that is a fairly good description of what I'm trying to do. Am i wrong in my interpretation of the instructions that it will actually do these things correctly? __________________________________ Do you Yahoo!? Jazz up your holiday email with celebrity designs. Learn more. http://celebrity.mail.yahoo.com From daniel.naber at t-online.de Wed Dec 15 19:15:22 2004 From: daniel.naber at t-online.de (Daniel Naber) Date: Wed Dec 15 19:14:06 2004 Subject: [Perlfect-search] RE: Http start url In-Reply-To: <20041215135603.30473.qmail@web14703.mail.yahoo.com> References: <20041215135603.30473.qmail@web14703.mail.yahoo.com> Message-ID: <200412152015.22829@danielnaber.de> On Wednesday 15 December 2004 14:56, david groeling wrote: > I can get it to scan but it scans 1 file and indexes 0 > files 0 content > as stated in previous message. So what is the exact output when indexing with $HTTP_DEBUG = 1? Is it a public server you're trying to index, i.e. can you tell us the URL? The broken images are a known bug, the fix is here: http://www.perlfect.com/cgi-bin/cvsweb.cgi/search/search.pl.diff?r1=1.95&r2=1.96&f=h Regards Daniel -- http://www.danielnaber.de