|
|
[Perlfect-search] Indexing Microsoft Documents
Yuriy Yakubovich yuriy@mint-tech.com
Tue, 2 Oct 2001 13:23:11 -0400
This is a multi-part message in MIME format.
------=_NextPart_000_000B_01C14B45.61FC0EE0
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Indexing Microsoft DocumentsI've done this using "to text" converters.
1. Install "catdoc" and "xls2csv" from http://freshmeat.net Make sure they
work!
2. Add the following to conf.pl:
==============================================================
# To convert DOC and EXCEL files we'll use /usr/local/bin/catdoc and
# /usr/local/bin/xls2csv
$DOCTOTEXT = '/usr/local/bin/catdoc';
$XLSTOTEXT = '/usr/local/bin/xls2csv';
==============================================================
3. In the file indexer.pl do search for PDFTOTEXT and add the following
blocks appropriately underneath:
==============================================================
# Checks if a file is DOC depending on the filename. If so, write it to a
# temporary file and feed it to $DOCTOTEXT, return the output. If it's not
# DOC, return the buffer unmodified.
sub parse_doc {
my $buffer = $_[0];
my $url = $_[1];
if ($url =~ m/\.doc$/i && $DOCTOTEXT) {
my $tmpfile = "$TMP_DIR/temp.doc";
# Saving to a temporary file is necessary for http requested DOCs. To
# keeps things simpler, we also do it for local files from disk.
open(TMPFILE, ">$tmpfile") or warn "Cannot write '$tmpfile': $!";
binmode(TMPFILE);
print TMPFILE ${$buffer};
close(TMPFILE);
# filename security check is done in to_be_ignored():
${$buffer} = `$DOCTOTEXT "$tmpfile"` or (warn "Cannot execute
'$DOCTOTEXT $t
mpfile -': $!" and return undef);
unlink $tmpfile or warn "Cannot remove '$tmpfile: $!'"
}
}
==============================================================
# Checks if a file is XLS depending on the filename. If so, write it to a
# temporary file and feed it to $XLSTOTEXT, return the output. If it's not
# XLS, return the buffer unmodified.
sub parse_xls {
my $buffer = $_[0];
my $url = $_[1];
if ($url =~ m/\.xls$/i && $XLSTOTEXT) {
my $tmpfile = "$TMP_DIR/temp.xls";
# Saving to a temporary file is necessary for http requested XLSs. To
# keeps things simpler, we also do it for local files from disk.
open(TMPFILE, ">$tmpfile") or warn "Cannot write '$tmpfile': $!";
binmode(TMPFILE);
print TMPFILE ${$buffer};
close(TMPFILE);
# filename security check is done in to_be_ignored():
${$buffer} = `$XLSTOTEXT "$tmpfile"` or (warn "Cannot execute
'$XLSTOTEXT $t
mpfile -': $!" and return undef);
unlink $tmpfile or warn "Cannot remove '$tmpfile: $!'"
}
}
===============================================================
# For DOC files check filename for security reasons (it later gets handed
to a
shell!):
if( $file =~ m/\.doc$/i && $DOCTOTEXT ) {
if( $file !~ m/^[\/\\a-zA-Z0-9_.:+-]*$/ || $file =~ m/\.\./ ) {
return "Ignoring '$file': illegal characters in filename";
}
}
# For XLS files check filename for security reasons (it later gets handed
to a
shell!):
if( $file =~ m/\.xls$/i && $XLSTOTEXT ) {
if( $file !~ m/^[\/\\a-zA-Z0-9_.:+-]*$/ || $file =~ m/\.\./ ) {
return "Ignoring '$file': illegal characters in filename";
}
}
===============================================================
4. In the file indexer_filesystem.pl do search for parse_pdf and add the
following underneath:
parse_doc(\$buffer, $file);
parse_xls(\$buffer, $file);
5. In the file indexer_web.pl do search for parse_pdf and add the following
underneath:
parse_doc(\$content, $url);
parse_xls(\$content, $url);
This works for me.
Good luck.
Yuriy
-----Original Message-----
From: perlfect-search-admin@perlfect.com
[mailto:perlfect-search-admin@perlfect.com]On Behalf Of Davone Vang
Sent: Tuesday, October 02, 2001 11:04 AM
To: perlfect-search@perlfect.com
Subject: [Perlfect-search] Indexing Microsoft Documents
I would like to know if Perlfect or anyone has found a way to index and
search Microsoft Documents? I have Perlfect 3.20 and I'm able to index PDF
files and would like to know if I can do the same for Doc files. I would
really appreciate any feedback on this. Thank you.
Davone
------=_NextPart_000_000B_01C14B45.61FC0EE0
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Indexing Microsoft Documents</TITLE>
<META content=3D"text/html; charset=3Diso-8859-1" =
http-equiv=3DContent-Type>
<META content=3D"MSHTML 5.00.3103.1000" name=3DGENERATOR></HEAD>
<BODY>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>I've=20
done this using "to text" converters.</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT> </DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>1. Install "catdoc" and "xls2csv" from =
<A=20
href=3D"http://freshmeat.net">http://freshmeat.net</A> Make sure =
they=20
work!</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT> </DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>2. Add the following to=20
conf.pl:</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</SPAN></F=
ONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001># To=20
convert DOC and EXCEL files we'll use /usr/local/bin/catdoc and<BR>#=20
/usr/local/bin/xls2csv<BR>$DOCTOTEXT =3D =
'/usr/local/bin/catdoc';<BR>$XLSTOTEXT =3D=20
'/usr/local/bin/xls2csv';<BR>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</SPAN>=
</FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT> </DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>3. In the file indexer.pl do =
search for=20
PDFTOTEXT and add the following blocks appropriately=20
underneath:</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</SPAN></F=
ONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>#=20
Checks if a file is DOC depending on the filename. If so, write it to =
a<BR>#=20
temporary file and feed it to $DOCTOTEXT, return the output. If it's =
not<BR>#=20
DOC, return the buffer unmodified.<BR>sub parse_doc {<BR> my =
$buffer =3D=20
$_[0];<BR> my $url =3D $_[1];<BR> if ($url =3D~ m/\.doc$/i =
&&=20
$DOCTOTEXT) {<BR> my $tmpfile =3D=20
"$TMP_DIR/temp.doc";<BR> # Saving to a temporary file =
is=20
necessary for http requested DOCs. To<BR> # keeps =
things=20
simpler, we also do it for local files from disk.<BR> =20
open(TMPFILE, ">$tmpfile") or warn "Cannot write '$tmpfile':=20
$!";<BR> binmode(TMPFILE);<BR> print =
TMPFILE=20
${$buffer};<BR> close(TMPFILE);<BR> =
#=20
filename security check is done in =
to_be_ignored():<BR> =20
${$buffer} =3D `$DOCTOTEXT "$tmpfile"` or (warn "Cannot execute =
'$DOCTOTEXT=20
$t<BR>mpfile -': $!" and return undef);<BR> unlink =
$tmpfile or=20
warn "Cannot remove '$tmpfile: $!'"<BR> }<BR>}</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</SPAN></F=
ONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>#=20
Checks if a file is XLS depending on the filename. If so, write it to =
a<BR>#=20
temporary file and feed it to $XLSTOTEXT, return the output. If it's =
not<BR>#=20
XLS, return the buffer unmodified.<BR>sub parse_xls {<BR> my =
$buffer =3D=20
$_[0];<BR> my $url =3D $_[1];<BR> if ($url =3D~ m/\.xls$/i =
&&=20
$XLSTOTEXT) {<BR> my $tmpfile =3D=20
"$TMP_DIR/temp.xls";<BR> # Saving to a temporary file =
is=20
necessary for http requested XLSs. To<BR> # keeps =
things=20
simpler, we also do it for local files from disk.<BR> =20
open(TMPFILE, ">$tmpfile") or warn "Cannot write '$tmpfile':=20
$!";<BR> binmode(TMPFILE);<BR> print =
TMPFILE=20
${$buffer};<BR> close(TMPFILE);<BR> =
#=20
filename security check is done in =
to_be_ignored():<BR> =20
${$buffer} =3D `$XLSTOTEXT "$tmpfile"` or (warn "Cannot execute =
'$XLSTOTEXT=20
$t<BR>mpfile -': $!" and return undef);<BR> unlink =
$tmpfile or=20
warn "Cannot remove '$tmpfile: $!'"<BR> =20
}<BR>}<BR>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001> =20
# For DOC files check filename for security reasons (it later gets =
handed to=20
a<BR> shell!):<BR> if( $file =3D~ m/\.doc$/i && =
$DOCTOTEXT )=20
{<BR> if( $file !~ m/^[\/\\a-zA-Z0-9_.:+-]*$/ || $file =
=3D~=20
m/\.\./ ) {<BR> return "Ignoring '$file': =
illegal=20
characters in filename";<BR> }<BR> }<BR> # =
For XLS=20
files check filename for security reasons (it later gets handed to=20
a<BR> shell!):<BR> if( $file =3D~ m/\.xls$/i && =
$XLSTOTEXT )=20
{<BR> if( $file !~ m/^[\/\\a-zA-Z0-9_.:+-]*$/ || $file =
=3D~=20
m/\.\./ ) {<BR> return "Ignoring '$file': =
illegal=20
characters in filename";<BR> }<BR> =20
}<BR>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT> </DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>4. In=20
the file indexer_filesystem.pl do search for parse_pdf and add the =
following=20
underneath:</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT> </DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001> parse_doc(\$buffer,=20
$file);<BR> parse_xls(\$buffer,=20
$file);<BR></SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>5. In the file indexer_web.pl do search =
for=20
parse_pdf and add the following underneath:</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT> </DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001> =20
parse_doc(\$content, =
$url);<BR> =20
parse_xls(\$content, $url);</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT> </DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>This=20
works for me.</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT> </DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>Good=20
luck.</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT> </DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>Yuriy<BR><BR></DIV></SPAN></FONT>
<BLOCKQUOTE=20
style=3D"BORDER-LEFT: #0000ff 2px solid; MARGIN-LEFT: 5px; MARGIN-RIGHT: =
0px; PADDING-LEFT: 5px">
<DIV align=3Dleft class=3DOutlookMessageHeader dir=3Dltr><FONT =
face=3DTahoma=20
size=3D2>-----Original Message-----<BR><B>From:</B>=20
perlfect-search-admin@perlfect.com=20
[mailto:perlfect-search-admin@perlfect.com]<B>On Behalf Of </B>Davone=20
Vang<BR><B>Sent:</B> Tuesday, October 02, 2001 11:04 AM<BR><B>To:</B>=20
perlfect-search@perlfect.com<BR><B>Subject:</B> [Perlfect-search] =
Indexing=20
Microsoft Documents<BR><BR></DIV></FONT>
<P><FONT face=3DArial size=3D2>I would like to know if Perlfect or =
anyone has=20
found a way to index and search Microsoft Documents? I have =
Perlfect=20
3.20 and I'm able to index PDF files and would like to know if I can =
do the=20
same for Doc files. I would really appreciate any feedback on=20
this. Thank you.</FONT></P>
<P><FONT face=3DArial size=3D2>Davone</FONT> =
</P></BLOCKQUOTE></BODY></HTML>
------=_NextPart_000_000B_01C14B45.61FC0EE0--
|
|