Perlfect Solutions
 

[Perlfect-search] Patch to increase low memory indexing speed

G. Edward Johnson edward_johnson@yahoo.com
Wed, 27 Mar 2002 17:30:09 -0800 (PST)
--0-1735424170-1017279009=:66125
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline



I have been playing with perlfect search and have come
up with some changes that signficantly speed up the
indexing when LOW_MEMORY_INDEX is set.  I think some
of you may be interested in trying this out.

What this patch does:
when LOW_MEMORY_INDEX is set, db hashes get opened on
the disk and lots of data is written to them.  By
keeping some of them in memory, then flushing them out
periodically, we can improve the speed while only
having a small effect on the amount of memory used.

How much does it help?
I was able to get a greater than a 33% speed increase
with conservative settings.  I believe that in certain
cases, it could double the speed of indexing.

I did a test where I indexed 991 8K html files, for a
total of about 8MB  with flushing every 100 documents.
 This gave me a 37% speed increase while keeping the
memory used under 8MB

Another test I did was approx. 13,000 8K documents
with flushing every 1000 documents.  This cut the
indexing time in half, while memory was under 20MB. 
This is probably best case, because these documents
are fairly small and there is a bit of repition
between files.

If you try the patch, let us know how it works and if
you find any problems with it.  The patch is generated
against version 1.68 indexer.pl that I got from CVS,
but I also checked and it applies correctly to the
current 1.70 version


How to Use it:
I have only tried it on Linux, so results on other
systems may vary.  Also, I have no idea how to apply a
patch on a windows system.

First, it only has an effect when $LOW_MEMORY_INDEX =
1; (in conf.pl)

Second, add the following to conf.pl:

# If you have low memory, you can not suffer quite as
much, by only
# flushing the databases periodically.  This variable
controls
# how many documents to process between flushes.  The
higher
# the number, the better performance, but the more
memory it will take
$FLUSH_FREQUENCY = 100;

Third, apply the patch.  Save this diff to a file in
the same directory as indexer.pl then run this
command:
patch < indexer.diff

That should be it.  Start testing it.
As Perlfect Search itself, this patche is provided
under the GPL.

Edward.




=====
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
G. Edward Johnson             -=-  Jabber: lorax@jabber.com
lorax@pobox.com               -=-     AIM: lorax2048
http://EdwardJohnson.com/     -=-   Yahoo: edward_johnson

__________________________________________________
Do You Yahoo!?
Yahoo! Movies - coverage of the 74th Academy Awards�
http://movies.yahoo.com/
--0-1735424170-1017279009=:66125
Content-Type: application/octet-stream; name="indexer.diff"
Content-Description: indexer.diff
Content-Disposition: attachment; filename="indexer.diff"
Content-Transfer-Encoding: base64

KioqIGluZGV4ZXIucGwJV2VkIE1hciAyNyAxOTozMjozOSAyMDAyCi0tLSBu
ZXcvaW5kZXhlci5wbAlXZWQgTWFyIDI3IDE5OjMwOjEwIDIwMDIKKioqKioq
KioqKioqKioqCioqKiA2Miw2NyAqKioqCi0tLSA2Miw3MCAtLS0tCiAgCiAg
JHw9MTsKICAKKyAjIGlmIEZMVVNIX0ZSRVFVRU5DWSBpcyBub3Qgc2V0IGlu
IHRoZSBjb25mLnBsLCBnaXZlIGl0IGEgdmFsdWUgaGVyZQorICRGTFVTSF9G
UkVRVUVOQ1kgPSAxIHVubGVzcyAkRkxVU0hfRlJFUVVFTkNZOworIAogICMg
Q2FsbGluZyB2aWEgQ0dJIGlzIGFsbG93ZWQgd2l0aCBwYXNzd29yZDoKICBp
ZiggJElOREVYRVJfQ0dJX1BBU1NXT1JEICYmICRFTlZ7J1JFUVVFU1RfTUVU
SE9EJ30gKSB7CiAgICBwcmludCAiQ29udGVudC1UeXBlOiB0ZXh0L3BsYWlu
XG5cbiI7CioqKioqKioqKioqKioqKgoqKiogMTEyLDExNyAqKioqCi0tLSAx
MTUsMTIzIC0tLS0KICAjIFRoZSBmb2xsb3dpbmcgdHdvIGhhc2hlcyBhcmUg
dGVtcG9yYXJ5IGFuZCB3aWxsIG5vdCBiZSBzYXZlZCB0byBkaXNrOgogIG15
ICVkZl9kYjsgICAgICAgICAgIyB0ZXJtIGlkIC0+IG51bWJlciBvZiBvY2N1
cmVuY2VzIG9mIHRoaXMgdGVybSBpbiBhbGwgZG9jdW1lbnRzCiAgbXkgJXRm
X2RiOyAgICAgICAgICAjIHRlcm0gaWQgLT4gbGlzdCBvZiBwYWlyczogKGRv
Y3VtZW50IGlkLCBudW1iZXIgb2Ygb2NjdXJlbmNlcyBpbiB0aGlzIGRvY3Vt
ZW50KQorIG15ICV0Zl9kYl90bXA7CisgbXkgJWRmX2RiX3RtcDsKKyAKICAK
ICBpZiggJExPV19NRU1PUllfSU5ERVggKSB7CiAgICB0aWUgJWludl9pbmRl
eF9kYiwgJGRiX3BhY2thZ2UsICRJTlZfSU5ERVhfVE1QX0RCX0ZJTEUsIE9f
Q1JFQVR8T19SRFdSLCAwNzU1IG9yIGRpZSAiQ2Fubm90IG9wZW4gJElOVl9J
TkRFWF9UTVBfREJfRklMRTogJCEiOwoqKioqKioqKioqKioqKioKKioqIDE0
MywxNDggKioqKgotLS0gMTQ5LDE1NiAtLS0tCiAgCWluaXRfZmlsZXN5c3Rl
bSgpOwogIAljcmF3bF9maWxlc3lzdGVtKCRET0NVTUVOVF9ST09UKTsKICB9
CisgIyBuZWVkIHRvIGZsdXNoIHRoZSBmaW5hbCBwYWdlcyB0byBkaXNrCisg
Zmx1c2hfdG9fZGlzaygpIGlmICRMT1dfTUVNT1JZX0lOREVYOwogIHByaW50
ICJDcmF3bGVyIGZpbmlzaGVkOiBpbmRleGVkICRETiBmaWxlcywgIi4oJFRO
X25vbl91bmlxdWUrJFROKS4iIHRlcm1zICgkVE4gZGlmZmVyZW50IHRlcm1z
KSwgXG4iOwogIHByaW50ICJpZ25vcmVkICRub19pbmRleF9jb3VudCBmaWxl
cyBiZWNhdXNlIG9mIGNvbmYvbm9faW5kZXgudHh0XG5cbiI7CiAgCioqKioq
KioqKioqKioqKgoqKiogMjYzLDI3NSAqKioqCiAgICAgICAgKyskdGZ7JHRl
cm1faWR9OwogICAgICB9CiAgICB9CiEgICAKISAgIGZvcmVhY2ggKGtleXMg
JXRmKSB7CiEgICAgICRkZl9kYnskX30rKzsKICAgICAgJHRmX2RieyRffSA9
ICcnIHVubGVzcyBkZWZpbmVkICR0Zl9kYnskX307CiEgICAgICR0Zl9kYnsk
X30gLj0gcGFjaygid3ciLCAkZG9jX2lkLCAkdGZ7JF99KTsgCiAgICB9CiAg
fQogIAogICMgQ2FsY3VsYXRlIHRoZSB3ZWlnaHQgKHNjb3JlKSBmb3IgZWFj
aCB0ZXJtIGluIGVhY2ggZmlsZSBhbmQgCiAgIyBzYXZlIGl0IHRvIHRoZSBk
YXRhYmFzZS4KLS0tIDI3MSwzMDYgLS0tLQogICAgICAgICsrJHRmeyR0ZXJt
X2lkfTsKICAgICAgfQogICAgfQohICAgICBmb3JlYWNoIChrZXlzICV0Zikg
ewohICAgICBpZigkTE9XX01FTU9SWV9JTkRFWCkgewohICAgICAgICRkZl9k
Yl90bXB7JF99Kys7CiEgICAgICAgJHRmX2RiX3RtcHskX30gPSAnJyB1bmxl
c3MgZGVmaW5lZCAkdGZfZGJfdG1weyRffTsKISAgICAgICAkdGZfZGJfdG1w
eyRffSAuPSBwYWNrKCJ3dyIsICRkb2NfaWQsICR0ZnskX30pOyAKISAgICAg
fSBlbHNlIHsKISAgICAgICAkZGZfZGJ7JF99Kys7CiEgICAgICAgJHRmX2Ri
eyRffSA9ICcnIHVubGVzcyBkZWZpbmVkICR0Zl9kYl90bXB7JF99OwohICAg
ICAgICR0Zl9kYnskX30gLj0gcGFjaygid3ciLCAkZG9jX2lkLCAkdGZ7JF99
KTsgCiEgICAgIH0KISAgIH0KISAgIGlmKCRMT1dfTUVNT1JZX0lOREVYICYm
ICgoJGRvY19pZCAlICRGTFVTSF9GUkVRVUVOQ1kpID09IDApKSB7CiEgICAg
ICMgZmx1c2ggdGhlIGluLW1lbW9yeSB0Zl9kYiBhbmQgZGZfZGIgdG8gZGlz
awohICAgICBmbHVzaF90b19kaXNrKCk7CiEgICB9CiEgfQohIAohIHN1YiBm
bHVzaF90b19kaXNrIHsKISAgIHByaW50ICJGbHVzaGluZy4uLlxuIjsKISAg
IGZvcmVhY2ggKGtleXMgJXRmX2RiX3RtcCkgewohICAgICAkZGZfZGJ7JF99
ID0gMCB1bmxlc3MgZGVmaW5lZCAkZGZfZGJ7JF99OwohICAgICAkZGZfZGJ7
JF99ICs9ICRkZl9kYl90bXB7JF99OwogICAgICAkdGZfZGJ7JF99ID0gJycg
dW5sZXNzIGRlZmluZWQgJHRmX2RieyRffTsKISAgICAgJHRmX2RieyRffSAu
PSAkdGZfZGJfdG1weyRffTsgCiAgICB9CisgICAjIG5vdyBjbGVhciBvdXQg
dG1wIGRhdGFiYXNlcworICAgJXRmX2RiX3RtcCA9ICgpOworICAgJWRmX2Ri
X3RtcCA9ICgpOwogIH0KKyAKICAKICAjIENhbGN1bGF0ZSB0aGUgd2VpZ2h0
IChzY29yZSkgZm9yIGVhY2ggdGVybSBpbiBlYWNoIGZpbGUgYW5kIAogICMg
c2F2ZSBpdCB0byB0aGUgZGF0YWJhc2UuCg==

--0-1735424170-1017279009=:66125--