Next Previous Contents

4. Summarizer

4.1 Why doesn't Post-Summarizing work?

The most common error is that the instructions are indented by spaces instead of a tab-stop. Check the Post-Summarizing rule file and make sure that instructions are indented by a tab-stop. The Post-Summarizing rule file uses a syntax like in Makefile. Conditions begin in the first column and instructions are indented by a tab-stop.

4.2 How can I summarize meta tags in HTML documents?

In Harvest 1.5.20.kj-0.3, the default summarizer for HTML data was switched to HTML-lax.sum which does not handle meta tags. Edit $HARVEST_HOME/lib/gatherer/HTML.sum and uncomment the SGML or Perl based summarizer.

4.3 Why are raw HTML tags in some query results?

If you see raw HTML tags in query results, the HTML summarizer was not able to parse the page correctly. Harvest comes with three different summarizers for HTML. If the default summarizer fails try the other two summarizers. To do this, edit $HARVEST_HOME/lib/gatherer/HTML.sum and uncomment one of the summarizers.

4.4 How can I summarize DVI files?

Use Harvest older than 1.5.20-kj-0.8 or newer than 1.7.2. The versions between these two versions have a bug which prevents DVI files being summarized.

4.5 How can I summarize Pdf files?

You need xpdf to summarize Pdf files. Harvest uses pdftotext from xpdf to summarize Pdf files.

Alternatively, you can use acroread to convert Pdf files to Postscript and pass it to Postscript summarizer. To do this, edit $HARVEST_HOME/lib/gatherer/Pdf.sum accordingly.

4.6 Where can I get pdftotext?

pdftotext is part of xpdf. It is available at Xpdf homepage http://www.foolabs.com/xpdf/.

4.7 How can I improve summarizer for Microsoft Word files?

Harvest uses catdoc to summarize Microsoft Word files. If you get bad summaries for Microsoft Word files, you might want to try wvHtml, which is part of wvWare, instead of catdoc.

4.8 Where can I get wvWare?

wvWare is available at wvWare homepage http://www.wvware.com/.

4.9 How can I add support for new file type?

Give the new file type a name and make Harvest know how to recognize the new file type by modifying byname.cf (to determine filetype by its name), byurl.cf (to determine filetype by the URL), or magic and bycontent.cf (to determine filetype by looking at the content of the file). You will find bycontent.cf, byname.cf, byurl.cf and magic in your $HARVEST_HOME/lib/gatherer/ directory.

Create a summarizer (a programm or script) which takes the filename as first argument and prints a SOIF stream "Attributename{length of data}:<tab>your data" to stdout. For file type "Xyz", you have to create a summarizer called Xyz.sum in the $HARVEST_HOME/lib/gatherer/ directory.

In most of the cases it might be easiest to convert filetype "Xyz" to a supported filetype like HTML, PostScript, etc. and use an existing summarizer on the converted file.

4.10 How can I use nsgmls instead of sgmls to summarize documents?

Edit $HARVEST_HOME/lib/gatherer/SGML.sum and set $sgmls_cmd = "/usr/local/bin/nsgmls" or where ever you have installed nsgmls.


Next Previous Contents