<!doctype linuxdoc system>

<article>

<title>Harvest FAQ

<author>Kang-Jin Lee <tt/lee@arco.de/

<date>2003-11-08

<abstract>Harvest frequently asked questions (FAQ) with answers

<toc>

<sect>Harvest

<p>

<sect1>What is Harvest?

<p>
Harvest is a system to collect information and make them searchable
using a web interface. Harvest can collect information on inter- and
intranet using http, ftp, nntp as well as local files like data on
harddisk, CDROM and file servers. Current list of supported formats in
addition to HTML include TeX, DVI, PS, full text, mail, man pages,
news, troff, WordPerfect, RTF, Microsoft Word/Excel, SGML, C sources
and many more. Stubs for PDF support is included in Harvest and will
use Xpdf or Acroread to process PDF files. Adding support for new
format is easy due to Harvest's modular design.

<sect1>Where can I get more information about Harvest?

<p>
See <htmlurl url="http://harvest.sourceforge.net/" name="Harvest
homepage http://harvest.sourceforge.net/"> for informations about
Harvest.

<sect1>Where can I download Harvest?

<p>
Harvest is available for download at
<htmlurl url="http://prdownloads.sourceforge.net/harvest/"
name="Harvest download page
http://prdownloads.sourceforge.net/harvest/">.

<sect1>Are there any information about Harvest in Russian?

<p>
Andrei Malashevich has translated the Harvest User's Manual to Russian. It is
available at his <htmlurl url="http://baby.chg.ru/manual_harvest/"
name="Harvest User's Manual page at http://baby.chg.ru/manual_harvest/">.

<sect1>What is Harvest-ng?

<p>
Harvest-ng is a reimplementation of Harvest's gatherer by Simon
Wilkinson. You can get more info about Harvest-ng at <htmlurl
url="http://webharvest.sourceforge.net/ng/" name="Harvest-ng
homepage http://webharvest.sourceforge.net/ng/">.

<sect1>What is the copyright status of Harvest?

<p>
The core of Harvest located in <em/src/ directory is under
GPL. Additional components, located in <em/components/ directory are
under GPL or similar copyright.

<sect1>Which Operating System do I need to run Harvest?

<p>
Harvest should run on any *nix like platforms including FreeBSD, Linux
and Solaris.

<sect1>Does Harvest run under Windows NT/2000/XP?

<p>
Michael Schlenker has ported Harvest to Windows platforms using
<htmlurl url="http://sources.redhat.com/cygwin/" name="Cygwin
http://sources.redhat.com/cygwin/">.

<sect1>What Hardware do I need to use Harvest?

<p>
A Pentium 120MHz with 64MB RAM should achieve reasonable performance
for around 350 MB of fulltext data in ca. 20.000 objects. A Pentium
650MHz with 256MB RAM should be able to handle around 1.5 GB of
fulltext data in ca. 100.000 objects.

<sect1>Which version of Harvest should I use?

<p>
<itemize>
<item>If you want to help developing Harvest, use the most recent
      version of Harvest.
<item>If you are cautious, a version older than a week should
      reasonably be safe to use.
<item>If you don't want to use development versions of Harvest, use
      the last version marked as stable.
</itemize>

<sect1>What are "harvest-modified-by-RL-Stajsic", "harvest-MathNet", and "harvest-1.5.20-kj"?

<p>
After the original authors ceased working on Harvest, there were some
periods where Harvest was unmaintained. During this time there were
following forked versions of Harvest:

<itemize>
<item>"harvest-modified-by-RL-Stajsic" was released by R.L. Stajsic
       and Tim Samshuijzen with some bugfixes.
<item>"harvest-MathNet" is a modified version of Harvest-1.5.20 to
      improve the handling of German specical characters ("Umlaute",
      "scharfes S").
<item>"harvest-1.5.20-kj" series were released by me with bugfixes
      to Harvest 1.5.20.
</itemize>

All these forked trees were merged into Harvest 1.6.

<sect1>What are the limits of Harvest?

<p>
<itemize>
<item>Harvest's Gatherer uses GDBM database to store the summarized
      data. On some architecture/OS, the maximum file size is 2 GB, so
      you can't have a database larger than 2 GB per Gatherer on those
      systems. To collect more data, you have to set up multiple
      Gatherers.
<item>The Broker stores the data as single files. On most OS,
      performance degrades noticeably with increasing number of files
      in a directory. Since the Broker uses finite number of directories
      defined in <em>src/broker/stor_man.c</em> to store the files,
      the broker will slow down with increasing number files.
</itemize>

<sect1>Do I need root access to install and run Harvest?

<p>
For initial setup, you must be able to modify the webserver
configuration and to schedule cron jobs. After the initial setup, it
is recommended to run Harvest as a different user for security
reasons.

<sect1>How do I block Harvest from my site? How do I identify Harvest?

<p>
Put a line like this to your robots.txt:

<tscreen><verb>
        User-agent: Harvest
        Disallow: /
</verb></tscreen>

<sect1>What can I do to help?

<p>
There are many ways to help depending your skills and time you want to
contribute to improve Harvest:

<itemize>
<item>Use Harvest and let others know that you are using Harvest.
<item>Use Harvest and let me know why you are using Harvest.
<item>Submit ideas, feature requests and bug reports.
<item>Contribute localization.
<item>Contribute documentation.
<item>Contribute code.
</itemize>

<sect>Building Harvest

<p>

<sect1>How do I uninstall Harvest?

<p>
Harvest keeps all of its files in <em>/usr/local/harvest</em> or whichever
<bf/prefix/ you have assigned during <tt/configure/. To uninstall Harvest,
simply delete the Harvest directory.

<p>
If you did following when installing Harvest:

<tscreen><verb>
        # ./configure --prefix=/home/data/harvest
</verb></tscreen>

<p>
then, this should uninstall Harvest:

<tscreen><verb>
        # rm -fr /home/data/harvest
</verb></tscreen>

You might also want to check the start scripts which start Harvest daemons
during system boot and remove <tt/cron/ jobs necessary for running Harvest.

<sect1>Where can I get bison and flex?

<p>
Bison and flex are available at <url url="ftp://ftp.gnu.org/"
name="GNU FTP Site"> and its mirrors.

<sect1>How can I install Harvest in "/my/directory/harvest" instead of "/usr/local/harvest"?

<p>
Do

<tscreen><verb>
        # ./configure --prefix=/my/directory/harvest
        # make
        # make install
</verb></tscreen>

<sect1>How can I avoid "syntax error before `regoff_t'" error message when compiling Harvest?

<p>
On some systems, building Harvest may fail with following message:

<tscreen><verb>
        Making all in util
        gcc  -I../include -I./../include -c buffer.c
        In file included from ../include/config.h:350,
                         from ../include/util.h:112,
                         from buffer.c:86:
        /usr/include/regex.h:46: syntax error before `regoff_t'
        /usr/include/regex.h:46: warning: data definition has no type or storage class
        /usr/include/regex.h:56: syntax error before `regoff_t'
        *** Error code 1
</verb></tscreen>

If you get this error, edit <em>src/common/include/autoconf.h</em>
and add "#define USE_GNU_REGEX 1" before typing <tt/make/ to build
Harvest.

<sect1>Where can I get more information for building Harvest on FreeBSD?

<p>
See <htmlurl url="http://www.freshports.org/www/harvest/"
name="FreshPorts Harvest page http://www.freshports.org/www/harvest/">
for more informations about building Harvest on FreeBSD.

<sect>Gatherer

<p>

<sect1>Does the Gatherer support cookies?

<p>
No, Harvest's Gatherer doesn't support cookies.

<sect1>Why doesn't Local-Mapping work?

<p>
In Harvest 1.7.7, the default HTML enumerator was switched from
<tt/httpenum-depth/ to <tt/httpenum-breadth/. The breadth first
enumerator had a bug in <bf/Local-Mapping/, which was fixed in Harvest
1.7.19. To make <bf/Local-Mapping/ work, use depth first enumerator or
update to Harvest 1.7.19 or later.

Local mapping will fail if the file is not readable by the gatherer
process, or the file is not a regular file, or the file has execute
bits set, or the filename contains characters that have to be escaped
(like tilde, space, curly brace, quote, etc). So, for directories,
symbolic links and cgi scripts, the gatherer will always contact the
server instead of using local file.

<sect1>Does the Gatherer gather the Root- and LeafNode-URLs periodically?

<p>
No, the Gatherer gathers Root- and LeafNode URLs only once. To check
the URLs periodically, you have to use cron (see "man 8 cron") to run
<tt>$HARVEST_HOME/gatherers/YOUR_GATHERER/RunGatherer</tt>.

<sect1>Can Harvest gather https URLs?

<p>
No, https is not supported by Harvest. To gather https URLs, use
Harvest-ng from Simon Wilkinson. It is available at <htmlurl
url="http://webharvest.sourceforge.net/ng/" name="Harvest-ng
homepage http://webharvest.sourceforge.net/ng/">.

<sect1>When will Harvest be able to gather https URLs?

<p>
This is not on top of my to-do list and may take some time.

<sect1>Does Harvest support client based scripting/plugin like Javascript, Flash?

<p>
No, Harvest's gatherer does not support Javascript, Flash, etc., and
there are no plans to add support for them.

<sect1>Why does the gatherer stop after gathering few pages?

<p>
Harvest's gatherer doesn't support Javascript, Flash, etc.
Check the site you want to gather and make sure that the site
is browsable without any plugins, Javascript, etc.

<sect1>How can I index local newsgroups? How can I put hostname into News URL?

<p>
You will find a News URL hostname patch by Collin Smith in the
<em/contrib/ directory.

NOTE: Even though most web browsers support this, this violates
RFC-1738.

<sect1>What do the gatherer options "Search=Breadth" and "Search=Depth" do and which keywords are available for "Search=" option?

<p>
Search option selects an enumerator for http and gopher URLs. Harvest
comes with breadth first (Search=Breadth) and depth first
(Search=Depth) enumerator for http and gopher. They have different
strategy when following the URLs to get a list of candidates for
processing. The breadth first enumerator processes all links in a
level before descending to next level. In case of limiting the number
of URLs to gather from a site, it will give you a more representative
overview of the site. The depth first enumerator will descend to next
level as soon as possible. When there are no links left for the
current branch, it will  process the next branch. The depth first
enumerator doesn't use as much memory as the breadth first
enumerator. If you don't have compelling reasons to switch from an
enumerator to the other, the default value should be a reasonable
choice.

<sect1>How can I index html pages generated by cgi scripts? How can I index URLs which has a "?" (question mark) in it?

<p>
Remove <em/HTTP-Query/ from
<em>$HARVEST_HOME/lib/gatherer/stoplist.cf</em> and
<em>$HARVEST_HOME/gatherers/YOUR_GATHERER/lib/stoplist.cf</em>. For
versions earlier than 1.7.5, you also have to create a (symbolic) link
from <tt>$HARVEST_HOME/lib/gatherer/HTML.sum</tt> to
<tt>$HARVEST_HOME/lib/gatherer/HTTP-Query.sum</tt>. To do this, type:

<tscreen><verb>
        # cd $HARVEST_HOME/lib/gatherer
        # ln -s HTML.sum HTTP-Query.sum
</verb></tscreen>

<sect1>Why is the gatherer so slow? How can I make it faster?

<p>
The gatherer's default setting is to sleep one second after retrieving
an URL. This is to avoid an overload of the webserver. If you gather
from webservers under your control and know that they can handle the
additional load caused by the gatherer add "Delay=0" in your root node
specification to disable the sleep.

The lines should look like:

<tscreen><verb>
        &lt;RootNodes&gt;
        http://www.SOMESERVER.com/ Search=Breadth Delay=0
        &lt;/RootNodes&gt;
</verb></tscreen>

Alternatively, you can set the delay value for all root nodes by
adding <bf/Acces-Delay: 0/ in your configuration file.

It should look like:

<tscreen><verb>
        Gatherer-Name:  YOUR Gatherer
        Gatherer-Port:  8500
        Top-Directory:  /HARVEST_DIR/work1/gatherers/testgather
        Access-Delay:   0

        &lt;RootNodes&gt;
        http://www.MYSITE.com/ Search=Breadth
        &lt;/RootNodes&gt;
</verb></tscreen>

<sect1>Why is the gatherer still so slow?

<p>
Harvest's gatherer is designed to handle many types of documents and
many types of protocols. To achieve this flexibility it uses external
programs to handle the different types of documents and protocols. For
example, when gathering HTML documents via HTTP, the document is
parsed twice. First to get list of candidates to gather and then to
get a summary of the document. The summarizer is started each time
when a document arrives, quits after summarizing that document and has
to be restarted for the next document. Compared to more HTTP/HTML
oriented approaches this causes a significant overhead when gathering
HTTP/HTML only.

Harvest retrieves one document at a time which causes slowdown if you
encounter a slow site. Due to implementation, the Gathering process is
quite heavyweight and uses up to 25 MB of RAM per Gatherer. For this
reason, there were no attempts to spawn more gatherers to optimize the
bandwidth usage.

<sect1>How do I request "304 Not Modified" answers from HTTP servers?

<p>
To send "Last Modified: xx" headers and get "304 Not Modified" answers
from HTTP servers, add following line to the gatherer's configuration
file:

<tscreen><verb>
        HTTP-If-Modified-Since: Yes
</verb></tscreen>

If the document hasn't changed since last gathering, the gatherer will
use the data from its database, instead of retrieving it again. This
will save bandwidth and speed up gathering significantly.

<sect1>Why does Harvest gather different URLs between gatherings?

<p>
When <bf/HTTP-If-Modified-Since/ is enabled, the candidate selection
scheme of the http enumerators will change for successful database
lookups. For unchanged URLs, the enumerators will behave more like
depth first gatherer. The result of the gatherings should be the same
if you are gathering all URLs of a site, but if you gather only parts
of a site by using <bf/URL=n/ with <bf>n &lt; number of URLs of a
site</bf> you will get different subset of the system you gather.

<sect1>Why has the Gatherer's database vanished after gathering?

<p>
The Gatherer uses GDBM databases to store its data on disk. Database
files for Gatherer can grow very large depending on how much data you
gather. On some systems, (e.g. i386 based Linux) the maximum file size
is 2GB. If the amount of data surpasses this limit, the GDBM database
file will be wiped from the disk.

<sect1>How can I avoid GDBM files growing very big during Gathering?

<p>
The Gatherer's temporary GDMB database file <em/WORKING.gdbm/ will
grow very rapidly when gathering nested objects like tar, tar.gz, zip
etc. archives. GDBM databases keep growing when tuples are inserted
and deleted from them, because GDBM reuses only fractions of the empty
filespace. To get rid of unused space, the GDBM database has to be
reorganized. The reorganization however is slow and will slow down the
gathering, so the default is not to reorganize the gatherer's
temporary database. This should work well for small to medium sized
Gatherers, but for large Gatherers it may be necessary to reorganize
the temporary database during gathering to keep the size of the
database at manageable level. To reorganize the <em/WORKING.gdbm/
every 100 deletions add following line to your gatherer configuration
file:

<tscreen><verb>
        Essence-Options: --max-deletions 100
</verb></tscreen>

Don't set this value too low, since it will consume significant share
of CPU time and disk I/O. Reorganizing every 10 to 100 deletions seems
to be a reasonable value.

<sect1>Can I use Htdig as Gatherer? Can the Broker import data from Htdig?

<p>
The perl module <em/Metadata/ from Dave Beckett can dump data from
Htdig database into a SOIF stream. Metadata only supports GDBM
databases, so this only works with versions earlier than Htdig 3.1,
because newer versions of Htdig switched from GDBM to Sleepycat's
Berkeley DB.

<sect1>How can I control access to Gatherer's database?

<p>
Edit <em>$HARVEST_HOME/gatherers/YOUR_GATHERER/data/gatherd.cf</em> to
allow or deny access. A line that begins with <bf>Allow</bf> is
followed by any number of domain or host names that are allowed to
connect to the Gatherer. If the word <bf>all</bf> is used, then all
hosts are matched. <bf>Deny</bf> is the opposite of
<bf>Allow</bf>. The following example will only allow hosts in the
<bf>cs.colorado.edu</bf> or <bf>usc.edu</bf> domain access the
Gatherer's database:

<tscreen><verb>
        Allow  cs.colorado.edu usc.edu
        Deny   all
</verb></tscreen>

<sect1>Does Harvest's Gatherer support WAP/WML, Gnutella, Napster?

<p>
No. Harvest's Gatherer doesn't support WAP. Peer to peer services like
Gnutella, Napster, etc. are also unsupported.

<sect1>How do I gather ftp URLs from wu-ftp daemons?

<p>
Changes in wu-ftpd 2.6.x broke <tt/ftpget/. There is a replacement for
it in contrib directory which wraps any ftp client to behave like
<tt/ftpget/.

<sect1>Why doesn't file URLs in LeafNodes work as expected?

<p>
File URLs pointing to directories like <em>file://misc/documents/</em>
in LeafNodes are considered as nested object which will be unnested.

<sect1>Why does gathering from a site fail completely or for parts of the site?

<p>
This may be caused by the site's <em/robots.txt/. You can check this
by typing "http://www.SOME.SITE.com/robots.txt" into your favourite
web browser.

<sect>Summarizer

<p>

<sect1>Why doesn't Post-Summarizing work?

<p>
The most common error is that the instructions are indented by spaces
instead of a tab-stop. Check the <bf/Post-Summarizing/ rule file and
make sure that instructions are indented by a tab-stop. The
<bf/Post-Summarizing/ rule file uses a syntax like in <em/Makefile/.
Conditions begin in the first column and instructions are indented by
a tab-stop.

<sect1>How can I summarize meta tags in HTML documents?

<p>
In Harvest 1.5.20.kj-0.3, the default summarizer for HTML data was
switched to <tt/HTML-lax.sum/ which does not handle meta tags. Edit
<tt>$HARVEST_HOME/lib/gatherer/HTML.sum</tt> and uncomment the SGML or
Perl based summarizer.

<sect1>Why are raw HTML tags in some query results?

<p>
If you see raw HTML tags in query results, the HTML summarizer was not
able to parse the page correctly. Harvest comes with three different
summarizers for HTML. If the default summarizer fails try the other
two summarizers. To do this, edit
<tt>$HARVEST_HOME/lib/gatherer/HTML.sum</tt> and uncomment one of the
summarizers.

<sect1>How can I summarize DVI files?

<p>
Use Harvest older than 1.5.20-kj-0.8 or newer than 1.7.2.
The versions between these two versions have a bug which prevents
DVI files being summarized.

<sect1>How can I summarize Pdf files?

<p>
You need <em/xpdf/ to summarize Pdf files. Harvest uses <tt/pdftotext/
from <em/xpdf/ to summarize Pdf files.

Alternatively, you can use <tt/acroread/ to convert Pdf files to
Postscript and pass it to Postscript summarizer. To do this, edit
<tt>$HARVEST_HOME/lib/gatherer/Pdf.sum</tt> accordingly.

<sect1>Where can I get pdftotext?

<p>
<tt/pdftotext/ is part of <em/xpdf/. It is available at <htmlurl
url="http://www.foolabs.com/xpdf/" name="Xpdf homepage
http://www.foolabs.com/xpdf/">.

<sect1>How can I improve summarizer for Microsoft Word files?

<p>
Harvest uses <em/catdoc/ to summarize Microsoft Word files. If you get
bad summaries for Microsoft Word files, you might want to try
<tt/wvHtml/, which is part of <em/wvWare/, instead of <em/catdoc/.

<sect1>Where can I get wvWare?

<p>
<em/wvWare/ is available at <htmlurl url="http://www.wvware.com/"
name="wvWare homepage http://www.wvware.com/">.

<sect1>How can I add support for new file type?

<p>
Give the new file type a name and make Harvest know how to recognize
the new file type by modifying <em/byname.cf/ (to determine filetype
by its name), <em/byurl.cf/ (to determine filetype by the URL), or
<em/magic/ and <em/bycontent.cf/ (to determine filetype by looking at
the content of the file). You will find <em/bycontent.cf/,
<em/byname.cf/, <em/byurl.cf/ and <em/magic/ in your
<em>$HARVEST_HOME/lib/gatherer/</em> directory.

Create a summarizer (a programm or script) which takes the filename as
first argument and prints a SOIF stream "Attributename{length of
data}:<tt>&lt;tab&gt;</tt>your data" to stdout. For file type "Xyz",
you have to create a summarizer called <tt/Xyz.sum/ in the
<em>$HARVEST_HOME/lib/gatherer/</em> directory.

In most of the cases it might be easiest to convert filetype "Xyz" to
a supported filetype like HTML, PostScript, etc. and use an existing
summarizer on the converted file.

<sect1>How can I use nsgmls instead of sgmls to summarize documents?

<p>
Edit <tt>$HARVEST_HOME/lib/gatherer/SGML.sum</tt> and set
<bf>$sgmls_cmd = "/usr/local/bin/nsgmls"</bf> or where ever you have
installed nsgmls.

<sect>Broker

<p>

<sect1>How can I start a Broker at boot time?

<p>
Some user contributed startup scripts are located in
<em>contrib/etc/</em> directory of Harvest source distribution. Modify
apropriate files and copy them to your startup script directory.

<sect1>How can I start a Broker without starting a collection?

<p>
When a Broker starts, it starts collecting data, which can take some
time. To avoid this, use the <bf/-nocol/ option when invoking
<tt/RunBroker/.

If you have installed Harvest in <em>/usr/local/harvest/</em>, put
following line into your startup file, e.g. <tt>/etc/rc.local</tt>:

<tscreen><verb>
        /usr/local/harvest/brokers/YOUR_BROKER/RunBroker -nocol
</verb></tscreen>

Replace <em>/usr/local/harvest/</em> with the directory where you have
installed Harvest.

<sect1>Why don't the documents which I have gathered right now show up in the Broker?

<p>
The Broker imports data from the Gatherer once in every 24 hours. If
you want to import the data immediately after gathering, just restart
the Broker or signal the Broker to import data.

You can signal the broker with the command line client <tt/brkclient/,
located in <em>$HARVEST_HOME/lib/broker/</em> by typing:

<tscreen><verb>
        # brkclient localhost 8501 '#ADMIN #Password secret #collection'
</verb></tscreen>

Replace hostname, port and password if necessary.

Other easier method is to use the WWW based admin interface at:
"http://www.YOUR_SERVER.com/Harvest/brokers/YOUR_BROKER/admin/admin.html".

<sect1>Why do I get error messages when I try to access "http://some.host/Harvest/brokers/your-broker-path/" after running $HARVEST_HOME/RunHarvest?

<p>
Check the error log of your http daemon. The http daemon must be able
to follow symbolic links. For apache httpd you can do this by adding:

<tscreen><verb>
        &lt;Location /Harvest/brokers/your-broker-path/&gt;
                Options FollowSymLinks
        &lt;/Location&gt;
</verb></tscreen>

to your <em/httpd.conf/.

If you don't want symbolic links, delete the symbolic link and copy
the file to the new name.

<sect1>Why are NEWS URLs broken? Where are the hostnames in NEWS URLs? How can I follow NEWS URLs?

<p>
Harvest's Gatherer doesn't put hostnames into NEWS URLs. If your web
browser complains about missing news server, configure your web
browser to use the news server of your provider, company or
organization as your default news server.

For more information why Harvest doesn't put hostnames into NEWS
URLs, see RFC-1738 chapter 3.6 and 3.7.

<sect1>Why don't I get any results if I use a long or complex query string?

<p>
The length of a query string is limited to 30 characters when using
regluar expressions (wildcards), excluding the escape characters.

<sect1>Can I use wildcards in attribute value for structured queries?

<p>
No, regular expressions for attribute names and attribute values in
structured queries aren't supported. So, queries like "Author: Smi.*"
or "Auth.*: Smith" won't do what you might expect.

<sect1>Are the attribute names case sensitive?

<p>
No, the attribute names are not case sensitiv. So, "Time-To-Live" is the
same like "Time-to-Live", "Time-to-live", "time-to-live", etc.

<sect1>Why doesn't collecting from broker work?

<p>
This is due to a bug introduced in Harvest 1.5.18. The bug was
fixed in 1.7.8. To make it work again, update to 1.7.8 or higher.

<sect1>How can I customize the Harvest user interface?

<p>
The query pages are located in
<em>$HARVEST_HOME/brokers/YOUR_BROKER/query-*</em>.
Most likely, you don't want to make all the variables visible to users
who want to query your broker. Edit <em/query-*/ and use the
<bf/hidden/ type to set suitable defaults for variables you want to
hide.

The result set presentation can be customized by choosing or modifying
the configuration files located in <em>$HARVEST_HOME/cgi-bin/lib/</em>
directory. The configuration files <em>Sample.cf, classic.cf,
modern.cf</em> and some <em>LANGUAGE.cf</em> are already installed in
<em>$HARVEST_HOME/cgi-bin/lib/</em> directory. You can either create a
new configuration file or modify one of th configuration files to get
the result set presentation you want. See the Harvest User's Manual
for information about available options for the configuration file.

If you want to customize the result presentation even further, then
edit <tt>$HARVEST_HOME/cgi-bin/search.cgi</tt>.

<sect1>How do I localize/translate user interface?

<p>
To localize the user interface, do:

<enum>
<item>Create
      <em>src/broker/example/brokers/skeleton/query-glimpse-modern.html.xx.in</em>,
      where <em>xx</em> is a two letter abbreviation for your
      language/country, by translating either
      <em>query-glimpse-modern.html.in</em> or other
      <em>query-glimpse-modern.html.yy.in</em>. This is the localized
      query page.
<item>Create <em>components/broker/standard/WWW/language.cf</em> by
      translating <em>modern.cf</em> or other translated configuration
      file like <em>spanish.cf</em>, <em>german.cf</em>, etc. This
      will localize the result pages and error messages.
<item>Create
      <em>src/broker/example/brokers/skeleton/query-glimpse.html.xx.in</em>
      by translating <em>query-glimpse.html.in</em> or
      <em>query-glimpse.html.yy.in</em>. This is the advanced query
      page.
<item>Translate <em>src/broker/example/brokers/*.html</em> to get
      localized additional help pages.
</enum>

<sect1>How can I replace the bundled Glimpse with an other version of Glimpse?

<p>
Edit <em>$HARVEST_HOME/brokers/YOUR_BROKER/admin/broker.conf</em> to
let Harvest know the location of your <tt/glimpse/, <tt/glimpseindex/,
and <tt/glimpseserver/.

<sect>Terms

<p>

<sect1>What is a Gatherer?

<p>
A Gatherer is a system that retrieves documents from various sources
(Web-, News-, FTP-server, local files) for processing. In HTML/HTTP
context, it is also often called <em/crawler/, <em/robot/, or
<em/spider/.

<sect1>What is Local-Mapping?

<p>
To reduce the CPU load and speed up Gathering, Harvest can map local
files to URLs. The gatherer can bypass the server and use local file,
while pretending that the objects were gatherered as usual to the rest
of the Harvest system.

<sect1>What is a Summarizer?

<p>
A Summarizer transforms a document into a form which is more suitable
for fulltext searching.

The HTML summarizer for example, extracts the title of a document,
removes all HTML tags, generates a wordlist, etc.

<sect1>What is a Broker?

<p>
A Broker processes search requests received from a user by a
cgi-script and presents the search results.

<sect>Miscellaneous

<p>

<sect1>Who are the maintainers of Harvest?

<p>
Kang-Jin Lee <tt/lee@arco.de/ and Harald Weinreich
<tt/harald@weinreichs.de/ are maintaining Harvest.

<sect1>I have found a bug. What should I do?

<p>
Post a bug report to the newsgroup comp.infosystems.harvest or mail
it to Kang-Jin Lee <tt/lee@arco.de/ and Harald Weinreich
<tt/harald@weinreichs.de/.

<sect1>Is there a mailinglist for Harvest? What about a newsgroup?

<p>
There is a <htmlurl
url="http://lists.sourceforge.net/lists/listinfo/harvest-devel/"
name="Harvest developer's mailinglist
http://lists.sourceforge.net/lists/listinfo/harvest-devel/"> for
Harvest users and developers. There also is a <url
url="news:comp.infosystems.harvest" name="Harvest newsgroup
news:comp.infosystems.harvest">.

</article>
