Perlfect Solutions  
 
 

Perlfect Search FAQ

Contents

  1. General
  2. Troubleshooting
  3. Usage

Last update: 2003-02-21


General

What is Perlfect Search?

Perlfect Search is a sophisticated, powerful, versatile, customizable and effective site indexing/searching suite available under an open source license (GPL). Features include full-text indexing, meta tag indexing, keyword-document scoring based on weights, directory/file exclusion, stopwords, word length limits, terminal/browser operation, advanced keyword based queries, boolean operators, compact index, fully customizable output layout, multi-page output, results ranked by relevance and an automatic installation and configuration utility.

Who needs Perlfect Search?

Anyone who needs to make their site searchable, that is to allow users to type a few keywords in a box and get a list of pages within the site relevant to their query.

Why should I use Perlfect Search instead of some search engine?

Because Perlfect Search works out-of-the-box and can still be adapted to your needs easily. The indexing and the search script together consist of just over 1400 lines of documented Perl source code. That's not much for a search engine and it lets you make your own modifications without much hassle.

How much does it cost?

Nothing. It is free, open source software under the terms of the GNU General Public Licence.

How many pages can it handle?

The latest version of Perlfect Search (v3.31) is capable of indexing sites with 2,000+ documents easily. There's a limit at 65,535 files which can be worked around if necessary.

The more pages you have, the more memory indexing will need. This is sometimes a problem for people whose webspace provider doesn't allow scripts to use much memory. You might get a "out of memory" error or the indexer.pl script might just stop before it finishes. In this case, you should talk to your webspace provider - it is not a bug in Perlfect Search and there's no simple way to "optimize" its memory usage without other drawbacks (the only case that can still be optimzed is the indexing of very large files). With some webspace providers CPU usage is a problem, too. If the indexer.pl script gets killed because it uses too much CPU time (i.e. it runs for too long), it's not a bug in Perlfect Search either and you'll have to contact your webspace provider.

The speed of searching is always very fast even for very large sites. In general, the performance of version 3.31 is more than enough for all but the most demanding sites.

How does it work?

It uses Berkeley DB, which is some sort of lightweight database. There are several 'tables', each consisting of a key/value pair. Each document has an ID, which enables you to get the URL, title and description of that document. There's a 'table' with all terms, it carries the information in which documents each term occurs. This 'table' makes searching very fast. You can find out more at our developer page. Also note that the application is published as Open Source so you can have look at the internals of the program yourself.

What platforms can it be installed on?

It has been tested and known to work on Linux and Windows. We expect that it should work on any UNIX variant that can run the perl interpreter.

Are there any special requirements?

You need to have Perl installed on the server's system and Perl's DB_File module. Depending on the way you index your pages, you might also needs some other modules (see the README file). You also need the right to install your own CGI scripts on your server.

Can I index pages that are generated dynamically by PHP/JSP/Perl etc?

Yes, just set $HTTP_START_URL in conf.pl to the URL where crawling should start. Note that you need some Perl modules as described in the README file and read the next question and this question.

Can I index remote servers? That's what indexing via http is for, right?

No, you should not use Perlfect Search on remote servers that are not yours. The admins of such servers would be very annoyed, because Perlfect Search will waste their bandwith. Indexing via http is useful if your own pages are generated dynamically (e.g. by PHP).

Can I index files locally and then upload the data files to the server?

Only if you are lucky. There's no guarantee that the index files (i.e. the files in the data directory) can be used on a different machine. If you want to try, make sure to use binary mode when uploading them via FTP.

I need feature foobar!

First have a look at conf.pl (example), it contains many options that are not described in the README or in this FAQ. Maybe the feature you're looking for is already in development, but not yet in the latest release: Have a look at the patches page. Finally, you can always ask for our professional service, just mail us at support@perlfect.com.

Are there any known problems?

Currently we know of these problems:
  • It does not always work with charsets that are not Latin-1 (ISO-8859-1) or plain 7bit ASCII. You'll have to try if it works for you. Unicode is not supported yet.
  • For large documents (more than some hundred kB) Perlfect Search needs much memory.
  • The order of results with the same score is not guaranteed to be the same every time. So if you reload a page where at least two documents have the same score, their order might change. (This is actually hardly a problem or bug and just listed for completeness.)
If you experience a different problem, make sure to update to the latest version.

Is it secure?

We're sure it is, if installed correctly. Make sure that indexer.pl is only executable by yourself (and not as a CGI). See the comments in search.pl about how to make it work under Perl's tainting mechanism.

If you're paranoid about DoS attacks, you can set a maximum execution time. Add something like this at the beginning of search.pl:
      $SIG{ALRM} = sub { die "stopped by SIGALRM"; };
      alarm 5;
    
This will kill the search after 5 seconds if it has not yet returned results.

How do you calculate the results' scores?

This is the formula:

score = occurrences of word in this document * log (number of documents / number of documents containing this word)

So the formula takes into account how often the word appears in the document, but also if it appears in many other documents. If so, it's less important. If the words appears in all documents, the document will be found, but its score will be 0.


Troubleshooting

I can't open the .tar.gz file!

On a Linux system use the command tar xvfz filename.tar.gz to extract the contents of the archive. On a Unix system without the GNU version of the tar program, use gunzip filename.tar.gz to unzip it first, and then tar xvf filename.tar to extract it. On Windows, Winzip can open tar.gz archives. If yours can't, you probably have an old version. The latest one is free for download from winzip.com. Also sometimes in Windows the default name for saving the file on download 'magically' take a .tar extension instead of .tar.gz which obviously confuses winzip. So make sure the extension is .tar.gz before trying to open the archive.

How do I install it?

There is a README file in the archive you downloaded. It should have instructions there based on your system. Usually all you have to do after extracting the archive is to run the command perl setup.pl over the telnet/ssh prompt in the directory where you have the extracted archive on your host machine. This will start the auto-configuration utility. Then you just follow the instructions.

I don't have ssh/telnet access to my host's machine.

You'll have to resort in manual configuration. The README file that came with the distribution you downloaded has very brief instructions on how to install manually. Manual installation should be easy if you're slightly familiar with installing CGI scripts in general, but it is not recommended if you don't know what you're doing. Often problems are not specific to Perlfect Search, so you should read the list of common CGI problems. If you get stuck you can always use our professional installation service.

In various places you refer to a "full path". What's that?

That is the absolute path, from the root of the filesystem. To find out what the full path to a directory is on UNIX, cd into it and then type the command pwd (print working directory) which will tell you the full path to the current directory. I think the same command (or wd) works on Windows, but I'm not sure if I remember well. Check with your manuals. Mind you, on Windows you need to include the drive letter qualifier, as in C: for example, but you do not need to use backslashes. Instead write the path like C:/cgi/perlfect/.

I am running on Windows, and the script uses / instead of \ in paths.

Don't worry. Perl handles both forward (/) and backward (\) slashes in file paths likewise, for compatibility with Windows. You can even mix both types of slashes and it will still work, but we recommend to use slashes (/). Just make sure you use the drive letter in all absolute paths on Windows.

I have trouble installing / using Perlfect Search under Windows.

That may be caused by a lack of the DB_File module. See the next question.

How do I install DB_File?

  • On Windows: just type ppm install DB_File
  • On Unix: just type perl -MCPAN -e 'install DB_File' (as root). If that doesn't work, download DB_File, untar the archive and follow the instructions in the README file.

It still doesn't work.

If you followed the instructions above and it still doesn't work, try this:
  • Make sure that search.pl is executable by the web server. On Linux/Unix, this usually means setting permissions of search.pl to 0755 (-rwxr-xr-x) with this command: chmod 0755 search.pl
  • Make sure that Perl5 is not only installed, but that it is used. On Linux/Unix, check the very first line of both indexer.pl and search.pl. It should point to the correct version of Perl on your system.
  • If search.pl doesn't work, try renaming it to search.cgi.
  • When using FTP you need to upload all files in ASCII mode (not binary). When using scp to copy the files make sure the line endings are correct (Windows and Unix have different line endings, use an editor that lets you choose which line endings you want to use or even better: edit the files directly on the server).
  • Check out a list of Common CGI Problems.
  • Uncomment this line in search.pl (i.e. remove the #):
    #use CGI::Carp qw(fatalsToBrowser);
    You should now get a more detailed error message when searching.
  • If you still get an "Internal Server Error", look for the web server's error log. Often it's called error_log or error.log. It will contain a more detailed description of the problem.

I get 'division by zero' when I run indexer.pl or other strange things happen, like 'sdbm store returned -1, errno 22'.

Try again after doing this:
  1. Remove all files in the data directory
  2. Install DB_File (see a previous question for instructions)
This shouldn't happen with that latest version, please update.

I always get "Can't 'next' outside a block at indexer_filesystem.pl line 33".

This is a bug only in version 3.20. In indexer_filesystem.pl you need to replace next by return in line 33 and 34.

It doesn't index all of my files.

If you are indexing your local filesystem check the values of @EXT in conf.pl and the entries in conf/no_index.txt. If you make changes to those values, you will have to run indexer.pl again. If you are indexing via http, check the following list:
  • Remember that the script will only find files that are directly or indirectly linked from your start page ($HTTP_START_URL).
  • If you have pages which are only accessible with Cookies, Java, Javascript, Flash etc enabled, these will not be indexed either. This is not a bug in Perlfect Search but a general problem that other search engines have too. You'll have to rework your pages to also work without Cookies/Java/Javascript/Flash.
  • Framesets need a proper <noframes> section with correct links.
  • Perlfect Search only follows links, it does not try to submit forms. So every automatically generated page that should be indexed needs to be accessible via a common link.
  • If all that does not help, check the values of @HTTP_CONTENT_TYPES and $HTTP_LIMIT_URL and turn on $HTTP_DEBUG to get more debugging output during indexing.

It doesn't index my PDF or MS-Word files.

First add "pdf" (and "PDF" if necessary) to @EXT in conf.pl. If you are indexing via http you have to add application/pdf to @HTTP_CONTENT_TYPES. Similar for MS-Word files, which have content type application/msword. In any case you have to set %EXT_FILTER correctly.

indexer.pl just stops before it finishes.

See the answer to the question "How many pages can it handle?".


Usage

How do I run the indexer?

Just cd in the directory where the indexer.pl file is installed (that should be perlfect/search/ in your cgi-bin directory if you used the setup utility) and type ./indexer.pl. That should do the trick.

How frequently should I run the indexer?

Search results will depend on what the indexer found in your site the last time it was run. So, each time you make changes to your site's content (move around files, alter the text, descriptions, keywords or titles of documents, or add/remove documents) you have to run the indexer again so that the index that facilitates the searches reflects the updated version of your site. You can use a scheduler to run the indexer overnight if you make frequent changes to your site. A common scheduler in UNIX is cron. You can find out how to use it by issuing the command man cron at the command prompt, to retrieve its manual page.

Do I have to wait for the indexer to finish?

If indexing takes very long, call it this way (on Unix):
nohup indexer.pl >indexer.out 2>&1 &
Indexing will then happen in the background and you can do other things or even log out. The file indexer.out will contain logging information and possible errors. You can view this file while it is written with tail -f indexer.out. If you are starting the index via the html form index_form.html you should wait until it finishes, don't quit your browser and don't click the browser's stop button.

How do I make the little box where the users type in keywords to make a search?

The README documentation explains that in detail.

Do I use POST or GET to call the script?

It works with both. Normally GET is a wise choice since it allows the user to bookmark the search results page, and users clicking on a result to go to one of your pages will show on the referrer log and allow you to look at what kind of things users search for (the search terms will be in the referrer URL). If you expect users to make considerably large queries (more than 200 characters long) you're better off using POST.

Do I have to make any modifications to my HTML documents in order to use Perlfect Search?

See the README for some HTML restrictions with the "highlight matches" feature. Besides that, you can optimize your documents for Perlfect Search. These optimizations are the same ones that make sense for all the big search engines like Google etc:
  • Use sensible <title>s
  • Use <meta> keywords and descriptions, but put only relevant words there
  • Use headlines (<h1> etc), don't just make the font big and bold
  • Use alt values for all your images
  • Your HTML documents should have correct syntax

Can I call it as an SSI?

No, since version 3.0, calling the script as an SSI will not give you the search box as it did in earlier versions. There's no need for that and it's a waste of system resources to call a script just to get a tiny snippet of HTML.

How do I exclude directories/files from indexing?

Inside the installation directory there is a directory called conf and inside it there's a file called no_index.txt containing the paths to all directories or files that are excluded from the indexing. Add the directories/files you want, and run the indexer again. You can use the * wildcard with the usual meaning it has in Linux/Unix or DOS/Windows, i.e. to exclude a complete directory use something like /usr/local/httpd/htdocs/private/*

If you have several files with the same name that should be excluded no matter in which directory they are use something like /usr/local/httpd/htdocs/*private_file.html or /usr/local/httpd/htdocs/*private_directory/* (assuming $DOCUMENT_ROOT = '/usr/local/httpd/htdocs').

If you are using $HTTP_START_URL, meta tags for robots in the HTML files will be checked, as will the server-wide /robots.txt. Remember that Perlfect Search should only be used to index your own server, not other people's servers!

How do I prevent the indexer from indexing specific words?

There's a file called stopwords.txt in the conf directory inside the installation directory. The default file that comes with the distribution contains a list of about 400 English words like 'the', 'they', 'their', etc that are usually ignored by most search engines. You can modify the list as you wish. After changes to stopwords.txt, you have to run indexer.pl again.

Can I search for special characters?

If you want to search for special characters like *!$/ etc: that's not possible, those characters get replaced by spaces before the are put into the index. If you want to search for umlauts, accented characters, etc: This will work as you expect, just type the word including the special character. Perlfect Search will index umlauts etc as the characters they are based on, e.g.
é is indexed as e
ö is indexed as o
ß is indexed as ss
This means that a search for Krüger will match Krüger and Kruger, a search for Kruger will give the same results. Note that this will not work if your documents are encoded with the Windows charset. Use Latin-1 (ISO-8859-1) or, even better, use entities to encode special characters.

How can I search for words with hyphens?

A word like CD-ROM is indexed as CDROM, and you will find it no matter if you search for CD-ROM or CDROM.

Can I search for text inside HTML comments?

No. You should use this fact as a feature to "comment out" Javascript and stylesheets, e.g.
      <style rel="stylesheet">
      <!--
      a:hover { ...}		/* this is not in the index */
      -->
      </style>
    

Does it work with mod_perl?

It should work with mod_perl since Perlfect Search version 3.09.

How can I use a PHP page as the result template?

Because Perlfect Search is a CGI script, its output cannot include PHP code in most cases. However, you can write a PHP page which then calls Perlfect Search and prints the result. This is called a wrapper and you can get one at the patches page.