Perlfect Solutions
 

[Perlfect-search] Indexing Microsoft Documents

Yuriy Yakubovich yuriy@mint-tech.com
Tue, 2 Oct 2001 13:23:11 -0400
This is a multi-part message in MIME format.

------=_NextPart_000_000B_01C14B45.61FC0EE0
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Indexing Microsoft DocumentsI've done this using "to text" converters.

1.  Install "catdoc" and "xls2csv" from http://freshmeat.net  Make sure they
work!

2.  Add the following to conf.pl:
==============================================================
# To convert DOC and EXCEL files we'll use /usr/local/bin/catdoc and
# /usr/local/bin/xls2csv
$DOCTOTEXT = '/usr/local/bin/catdoc';
$XLSTOTEXT = '/usr/local/bin/xls2csv';
==============================================================

3.  In the file indexer.pl do search for PDFTOTEXT and add the following
blocks appropriately underneath:
==============================================================
# Checks if a file is DOC depending on the filename. If so, write it to a
# temporary file and feed it to $DOCTOTEXT, return the output. If it's not
# DOC, return the buffer unmodified.
sub parse_doc {
  my $buffer = $_[0];
  my $url = $_[1];
  if ($url =~ m/\.doc$/i && $DOCTOTEXT) {
    my $tmpfile = "$TMP_DIR/temp.doc";
    # Saving to a temporary file is necessary for http requested DOCs. To
    # keeps things simpler, we also do it for local files from disk.
    open(TMPFILE, ">$tmpfile") or warn "Cannot write '$tmpfile': $!";
    binmode(TMPFILE);
    print TMPFILE ${$buffer};
    close(TMPFILE);
    # filename security check is done in to_be_ignored():
    ${$buffer} = `$DOCTOTEXT "$tmpfile"` or (warn "Cannot execute
'$DOCTOTEXT $t
mpfile -': $!" and return undef);
    unlink $tmpfile or warn "Cannot remove '$tmpfile: $!'"
  }
}
==============================================================
# Checks if a file is XLS depending on the filename. If so, write it to a
# temporary file and feed it to $XLSTOTEXT, return the output. If it's not
# XLS, return the buffer unmodified.
sub parse_xls {
  my $buffer = $_[0];
  my $url = $_[1];
  if ($url =~ m/\.xls$/i && $XLSTOTEXT) {
    my $tmpfile = "$TMP_DIR/temp.xls";
    # Saving to a temporary file is necessary for http requested XLSs. To
    # keeps things simpler, we also do it for local files from disk.
    open(TMPFILE, ">$tmpfile") or warn "Cannot write '$tmpfile': $!";
    binmode(TMPFILE);
    print TMPFILE ${$buffer};
    close(TMPFILE);
    # filename security check is done in to_be_ignored():
    ${$buffer} = `$XLSTOTEXT "$tmpfile"` or (warn "Cannot execute
'$XLSTOTEXT $t
mpfile -': $!" and return undef);
    unlink $tmpfile or warn "Cannot remove '$tmpfile: $!'"
  }
}
===============================================================
  # For DOC files check filename for security reasons (it later gets handed
to a
 shell!):
  if( $file =~ m/\.doc$/i && $DOCTOTEXT ) {
    if( $file !~ m/^[\/\\a-zA-Z0-9_.:+-]*$/ || $file =~ m/\.\./ ) {
      return "Ignoring '$file': illegal characters in filename";
    }
  }
  # For XLS files check filename for security reasons (it later gets handed
to a
 shell!):
  if( $file =~ m/\.xls$/i && $XLSTOTEXT ) {
    if( $file !~ m/^[\/\\a-zA-Z0-9_.:+-]*$/ || $file =~ m/\.\./ ) {
      return "Ignoring '$file': illegal characters in filename";
    }
  }
===============================================================

4. In the file indexer_filesystem.pl do search for parse_pdf and add the
following underneath:

    parse_doc(\$buffer, $file);
    parse_xls(\$buffer, $file);

5.  In the file indexer_web.pl do search for parse_pdf and add the following
underneath:

        parse_doc(\$content, $url);
        parse_xls(\$content, $url);

This works for me.

Good luck.

Yuriy


  -----Original Message-----
  From: perlfect-search-admin@perlfect.com
[mailto:perlfect-search-admin@perlfect.com]On Behalf Of Davone Vang
  Sent: Tuesday, October 02, 2001 11:04 AM
  To: perlfect-search@perlfect.com
  Subject: [Perlfect-search] Indexing Microsoft Documents


  I would like to know if Perlfect or anyone has found a way to index and
search Microsoft Documents?  I have Perlfect 3.20 and I'm able to index PDF
files and would like to know if I can do the same for Doc files.  I would
really appreciate any feedback on this.  Thank you.

  Davone


------=_NextPart_000_000B_01C14B45.61FC0EE0
Content-Type: text/html;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Indexing Microsoft Documents</TITLE>
<META content=3D"text/html; charset=3Diso-8859-1" =
http-equiv=3DContent-Type>
<META content=3D"MSHTML 5.00.3103.1000" name=3DGENERATOR></HEAD>
<BODY>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>I've=20
done this using "to text" converters.</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>1.&nbsp; Install "catdoc" and "xls2csv" from =
<A=20
href=3D"http://freshmeat.net">http://freshmeat.net</A>&nbsp; Make sure =
they=20
work!</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>2.&nbsp; Add the following to=20
conf.pl:</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</SPAN></F=
ONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001># To=20
convert DOC and EXCEL files we'll use /usr/local/bin/catdoc and<BR>#=20
/usr/local/bin/xls2csv<BR>$DOCTOTEXT =3D =
'/usr/local/bin/catdoc';<BR>$XLSTOTEXT =3D=20
'/usr/local/bin/xls2csv';<BR>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</SPAN>=
</FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>3.&nbsp;&nbsp;In the file indexer.pl do =
search for=20
PDFTOTEXT and add the following blocks appropriately=20
underneath:</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</SPAN></F=
ONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>#=20
Checks if a file is DOC depending on the filename. If so, write it to =
a<BR>#=20
temporary file and feed it to $DOCTOTEXT, return the output. If it's =
not<BR>#=20
DOC, return the buffer unmodified.<BR>sub parse_doc {<BR>&nbsp; my =
$buffer =3D=20
$_[0];<BR>&nbsp; my $url =3D $_[1];<BR>&nbsp; if ($url =3D~ m/\.doc$/i =
&amp;&amp;=20
$DOCTOTEXT) {<BR>&nbsp;&nbsp;&nbsp; my $tmpfile =3D=20
"$TMP_DIR/temp.doc";<BR>&nbsp;&nbsp;&nbsp; # Saving to a temporary file =
is=20
necessary for http requested DOCs. To<BR>&nbsp;&nbsp;&nbsp; # keeps =
things=20
simpler, we also do it for local files from disk.<BR>&nbsp;&nbsp;&nbsp;=20
open(TMPFILE, "&gt;$tmpfile") or warn "Cannot write '$tmpfile':=20
$!";<BR>&nbsp;&nbsp;&nbsp; binmode(TMPFILE);<BR>&nbsp;&nbsp;&nbsp; print =
TMPFILE=20
${$buffer};<BR>&nbsp;&nbsp;&nbsp; close(TMPFILE);<BR>&nbsp;&nbsp;&nbsp; =
#=20
filename security check is done in =
to_be_ignored():<BR>&nbsp;&nbsp;&nbsp;=20
${$buffer} =3D `$DOCTOTEXT "$tmpfile"` or (warn "Cannot execute =
'$DOCTOTEXT=20
$t<BR>mpfile -': $!" and return undef);<BR>&nbsp;&nbsp;&nbsp; unlink =
$tmpfile or=20
warn "Cannot remove '$tmpfile: $!'"<BR>&nbsp; }<BR>}</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</SPAN></F=
ONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>#=20
Checks if a file is XLS depending on the filename. If so, write it to =
a<BR>#=20
temporary file and feed it to $XLSTOTEXT, return the output. If it's =
not<BR>#=20
XLS, return the buffer unmodified.<BR>sub parse_xls {<BR>&nbsp; my =
$buffer =3D=20
$_[0];<BR>&nbsp; my $url =3D $_[1];<BR>&nbsp; if ($url =3D~ m/\.xls$/i =
&amp;&amp;=20
$XLSTOTEXT) {<BR>&nbsp;&nbsp;&nbsp; my $tmpfile =3D=20
"$TMP_DIR/temp.xls";<BR>&nbsp;&nbsp;&nbsp; # Saving to a temporary file =
is=20
necessary for http requested XLSs. To<BR>&nbsp;&nbsp;&nbsp; # keeps =
things=20
simpler, we also do it for local files from disk.<BR>&nbsp;&nbsp;&nbsp;=20
open(TMPFILE, "&gt;$tmpfile") or warn "Cannot write '$tmpfile':=20
$!";<BR>&nbsp;&nbsp;&nbsp; binmode(TMPFILE);<BR>&nbsp;&nbsp;&nbsp; print =
TMPFILE=20
${$buffer};<BR>&nbsp;&nbsp;&nbsp; close(TMPFILE);<BR>&nbsp;&nbsp;&nbsp; =
#=20
filename security check is done in =
to_be_ignored():<BR>&nbsp;&nbsp;&nbsp;=20
${$buffer} =3D `$XLSTOTEXT "$tmpfile"` or (warn "Cannot execute =
'$XLSTOTEXT=20
$t<BR>mpfile -': $!" and return undef);<BR>&nbsp;&nbsp;&nbsp; unlink =
$tmpfile or=20
warn "Cannot remove '$tmpfile: $!'"<BR>&nbsp;=20
}<BR>}<BR>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>&nbsp;=20
# For DOC files check filename for security reasons (it later gets =
handed to=20
a<BR>&nbsp;shell!):<BR>&nbsp; if( $file =3D~ m/\.doc$/i &amp;&amp; =
$DOCTOTEXT )=20
{<BR>&nbsp;&nbsp;&nbsp; if( $file !~ m/^[\/\\a-zA-Z0-9_.:+-]*$/ || $file =
=3D~=20
m/\.\./ ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return "Ignoring '$file': =
illegal=20
characters in filename";<BR>&nbsp;&nbsp;&nbsp; }<BR>&nbsp; }<BR>&nbsp; # =
For XLS=20
files check filename for security reasons (it later gets handed to=20
a<BR>&nbsp;shell!):<BR>&nbsp; if( $file =3D~ m/\.xls$/i &amp;&amp; =
$XLSTOTEXT )=20
{<BR>&nbsp;&nbsp;&nbsp; if( $file !~ m/^[\/\\a-zA-Z0-9_.:+-]*$/ || $file =
=3D~=20
m/\.\./ ) {<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return "Ignoring '$file': =
illegal=20
characters in filename";<BR>&nbsp;&nbsp;&nbsp; }<BR>&nbsp;=20
}<BR>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>4. In=20
the file indexer_filesystem.pl do search for parse_pdf and add the =
following=20
underneath:</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>&nbsp;&nbsp;&nbsp; parse_doc(\$buffer,=20
$file);<BR>&nbsp;&nbsp;&nbsp; parse_xls(\$buffer,=20
$file);<BR></SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>5.&nbsp; In the file indexer_web.pl do search =
for=20
parse_pdf and add the following underneath:</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
parse_doc(\$content, =
$url);<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
parse_xls(\$content, $url);</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>This=20
works for me.</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN =
class=3D760110717-02102001>Good=20
luck.</SPAN></FONT></DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT color=3D#0000ff face=3DArial size=3D2><SPAN=20
class=3D760110717-02102001>Yuriy<BR><BR></DIV></SPAN></FONT>
<BLOCKQUOTE=20
style=3D"BORDER-LEFT: #0000ff 2px solid; MARGIN-LEFT: 5px; MARGIN-RIGHT: =
0px; PADDING-LEFT: 5px">
  <DIV align=3Dleft class=3DOutlookMessageHeader dir=3Dltr><FONT =
face=3DTahoma=20
  size=3D2>-----Original Message-----<BR><B>From:</B>=20
  perlfect-search-admin@perlfect.com=20
  [mailto:perlfect-search-admin@perlfect.com]<B>On Behalf Of </B>Davone=20
  Vang<BR><B>Sent:</B> Tuesday, October 02, 2001 11:04 AM<BR><B>To:</B>=20
  perlfect-search@perlfect.com<BR><B>Subject:</B> [Perlfect-search] =
Indexing=20
  Microsoft Documents<BR><BR></DIV></FONT>
  <P><FONT face=3DArial size=3D2>I would like to know if Perlfect or =
anyone has=20
  found a way to index and search Microsoft Documents?&nbsp; I have =
Perlfect=20
  3.20 and I'm able to index PDF files and would like to know if I can =
do the=20
  same for Doc files.&nbsp; I would really appreciate any feedback on=20
  this.&nbsp; Thank you.</FONT></P>
  <P><FONT face=3DArial size=3D2>Davone</FONT> =
</P></BLOCKQUOTE></BODY></HTML>

------=_NextPart_000_000B_01C14B45.61FC0EE0--