<!doctype linuxdoc system>

<article>

<title>Harvest User's Manual

<author>
Darren R. Hardy, Michael F. Schwartz, Duane Wessels, Kang-Jin Lee

<date>2002-10-29

<abstract>
Harvest User's Manual was edited by Kang-Jin Lee and covers Harvest
version 1.8.  It was originally written by Darren R. Hardy, Michael
F. Schwartz and Duane Wessels for Harvest 1.4.pl2 in 1996-01-31.

<toc>

<sect>Introduction to Harvest

<p>
HARVEST is an integrated set of tools to gather, extract, organize,
and search information across the Internet.  With modest effort users
can tailor Harvest to digest information in many different formats, and
offer custom search services on the Internet.

A key goal of Harvest is to provide a flexible system that can be
configured in various ways to create many types of indexes.

Harvest also allows users to extract structured (attribute-value pair)
information from many different information formats and build indexes
that allow these attributes to be referenced during queries (e.g.,
searching for all documents with a certain regular expression in the
title field).

An important advantage of Harvest is that it allows users to build indexes
using either manually constructed templates (for maximum control over index
content) or automatically extracted data constructed templates (for easy
coverage of large collections), or using a hybrid of the two methods.

Harvest is designed to make it easy to distribute the search system
on a pool of networked machines to handle higher load.

<sect1>Copyright

<p>
The core of Harvest is licensed under <url url="../../COPYING"
name="GPL">.  Additional components distributed with Harvest are also
under GPL or similar license.  Glimpse, the current default fulltext
indexer has a different license.  Here is a clarification of <url
url="../glimpse-license-status" name="Glimpse' copyright status">
kindly posted by <url url="mailto:gvelez@tucson.com" name="Golda
Velez"> to <url url="news:comp.infosystems.harvest"
name="comp.infosystems.harvest">.

<sect1>Online Harvest Resources

<p>
This manual is available at
<htmlurl
url="http://harvest.sourceforge.net/harvest/doc/html/manual.html"
name="harvest.sourceforge.net/harvest/doc/html/manual.html">.

More information about Harvest is available at
<htmlurl url="http://harvest.sourceforge.net/"
name="harvest.sourceforge.net">.

<sect>Subsystem Overview

<p>
Harvest consists of several subsystems.  The <em>Gatherer</em> subsystem
collects indexing information (such as keywords, author names, and titles)
from the resources available at <em>Provider</em> sites (such as FTP and
HTTP servers).  The <em>Broker</em> subsystem retrieves indexing information
from one or more Gatherers, suppresses duplicate information, incrementally
indexes the collected information, and provides a WWW query interface to it.

<label id="img1">
<figure loc="tbp">
<eps file="../images/img1.eps" height="10cm">
<img src="../images/img1.png">
<caption>Harvest Software Components</caption>
</figure>

You should start using Harvest simply, by installing a single ``stock''
(i.e., not customized) Gatherer and Broker on one machine to index some of
the FTP, World Wide Web, and NetNews data at your site.

After you get the system working in this basic configuration, you can
invest additional effort as warranted.  First, as you scale up to index
larger volumes of information, you can reduce the CPU and network load
to index your data by distributing the gathering process.  Second, you
can customize how Harvest extracts, indexes, and searches your
information, to better match the types of data you have and the ways
your users would like to interact with the data.

We discuss how to distribute the gathering process in the next subsection.
We cover various forms of customization in Section
<ref id="Customizing the type recognition, candidate selection,
presentation unnesting, and summarizing steps"
name="Customizing the type recognition, candidate selection,
presentation unnesting, and summarizing steps"> and in several parts
of Section <ref id="The Broker" name="The Broker">.

<sect1>Distributing the Gathering and Brokering Processes

<p>
Harvest Gatherers and Brokers can be configured in various ways.  Running
a Gatherer remotely from a Provider site allows Harvest to interoperate
with sites that are not running Harvest Gatherers, by using standard object
retrieval protocols like FTP, Gopher, HTTP, and NNTP.  However, as suggested
by the bold lines in the left side of Figure <ref id="img2" name="2">,
this arrangement results in excess server and network load.  Running a
Gatherer locally is much more efficient, as shown in the right side of Figure
<ref id="img2" name="2">.  Nonetheless, running a Gatherer remotely
is still better than having many sites independently collect indexing
information, since many Brokers or other search services can share the
indexing information that the Gatherer collects.

If you have a number of FTP/HTTP/Gopher/NNTP servers at your site, it is
most efficient to run a Gatherer on each machine where these servers
run.  On the other hand, you can reduce installation effort by running a
Gatherer at just one machine at your site and letting it retrieve data
from across the network.

<label id="img2">
<figure loc="tbp">
<eps file="../images/img2.eps" height="10cm">
<img src="../images/img2.png">
<caption>Harvest Configuration Options</caption>
</figure>

Figure <ref id="img2" name="2"> also illustrates that a Broker can
collect information from many Gatherers (to build an index of widely
distributed information).  Brokers can also retrieve information from other
Brokers, in effect cascading indexed views from one another.  Brokers
retrieve this information using the query interface, allowing them to filter
or refine the information from one Broker to the next.

<sect>Installing the Harvest Software
<label id="Installing the Harvest Software">

<p>

<sect1>Requirements for Harvest Servers

<p>

<sect2>Hardware

<p>
A good machine for running a typical Harvest server will have a reasonably
fast processor, 1-2 GB of free disk, and 128 MB of RAM.
A slower CPU will work but it will slow down the Harvest server.  More
important than CPU speed, however, is memory size.  Harvest uses a number of
processes, some of which provide needed ``plumbing'' (e.g.,
<tt>search.cgi</tt>), and some of which improve performance (e.g., the
<tt>glimpseserver</tt> process).  If you do not have enough memory, your
system will page too much, and drastically reduce performance.  The other
factor affecting RAM usage is how much data you are trying to index in a
Harvest Broker.  The more data, the more disk I/O will be performed at query
time, the more RAM it will take to provide a reasonable sized disk buffer pool.

The amount of disk you'll need depends on how much data you want to
index in a single Broker.  (It is possible to distribute your index over
multiple Brokers if it gets too large for one disk.)  A good rule of
thumb is that you will need about 10% as much disk to hold the Gatherer
and Broker databases as the total size of the data you want to index.
The actual space needs will vary depending on the type of data you are
indexing.  For example, PostScript achieves a much higher indexing space
reduction than HTML, because so much of the PostScript data (such as
page positioning information) is discarded when building the index.

<sect2>Platforms

<p>
To run a Harvest server, you need an UNIX-like Operating System.

<sect2>Software

<p>
To use Harvest, you need the following software packages:

<itemize>
<item>All Harvest servers require: Perl v5.0 or higher.
<item>The Harvest Broker and Gatherer require: GNU <tt>gzip</tt> v1.2.4 or
      higher.
<item>The Harvest Broker requires: HTTP server.
</itemize>

To build Harvest from the source distribution you may need to install one
or more of the following software packages:

<itemize>
<item>Compiling Harvest requires: GNU <tt>gcc</tt> v2.5.8 or higher.
<item>Compiling the Harvest Broker requires: <tt>flex</tt> v2.4.7 or higher and
     <tt>bison</tt> v1.22 or higher.
</itemize>

The sources for <tt>gcc</tt>, <tt>gzip</tt>, <tt>flex</tt>, and <tt>bison</tt>
are available at the <url url="ftp://ftp.gnu.org/" name="GNU FTP server">.

<sect1>Requirements for Harvest Users

<p>
Anyone with a web browser (e.g., Internet Explorer, Lynx, Mozilla,
Netscape, Opera, etc.) can access and use Harvest servers.

<sect1>Retrieving and Installing the Harvest Software

<p>

<sect2>Distribution types

<p>
Currently we offer only source distribution of Harvest.  The <em>source
distribution</em> contains all of the source code for the Harvest software.
There are no <em>binary distributions</em> of Harvest.

You can retrieve the Harvest source distributions from the Harvest download
site <htmlurl url="http://prdownloads.sourceforge.net/harvest/"
name="prdownloads.sourceforge.net/harvest/">.

<sect2>Harvest components

<p>
Harvest components are in the <em>components</em> directory.  To use a
component, follow the instructions included in the desired component
directory.

<sect2>User-contributed software

<p>
There is a collection of unsupported user-contributed software in
<em>contrib</em> directory.  If you would like to contribute some software,
please send email to <url url="mailto:lee@arco.de" name="lee@arco.de">.

<sect1>Building the Source Distribution

<p>
The source distribution can be extracted in any directory.  The following
command will extract the gnu-zipped source archive:

<tscreen><verb>
        % gzip -dc harvest-x.y.z.tar.gz | tar xf -
</verb></tscreen>

For archives compressed with bzip2, use:

<tscreen><verb>
        % bzip2 -dc harvest-x.y.z.tar.bz2 | tar xf -
</verb></tscreen>

Harvest uses GNU's <em>autoconf</em> package to perform needed configuration
at installation time.  If you want to override the default installation
location of <em>/usr/local/harvest</em>, change the ``prefix'' variable when
invoking ``configure''.  If desired, you may edit
<em>src/common/include/config.h</em> before compiling to change various
Harvest compile-time limits and variables.  To compile the source tree type
<tt>make</tt>.

For example, to build and install the entire Harvest system into
<em>/usr/local/harvest</em> directory, type:

<tscreen><verb>
        % ./configure
        % make
        % make install
</verb></tscreen>

You may see some compiler warning messages, which you can ignore.

Building the entire Harvest distribution will take few minutes on a
reasonably fast machine.  The compiled source tree takes approximately 25
megabytes of disk space.

Later, after the installed software working, you can remove the compiled
code (``.o'' files) and other intermediate files by typing
<tt>make clean</tt>.  If you want to remove the configure-generated
Makefiles, type <tt>make distclean</tt>.

<sect1>Additional installation for the Harvest Broker
<label id="Additional installation for the Harvest Broker">

<p>

<sect2>Checking the installation for HTTP access

<p>
The Broker interacts with your HTTP server in a number of ways.
You should make sure that the HTTP server can properly access
the files it needs.  In many cases, the HTTP server will run
under a different userid than the owner of the Harvest files.

First, make sure the HTTP server userid can read the <em>query.html</em>
files in each broker directory.  Second, make sure the HTTP server
userid can access and execute the CGI programs in
<em>$HARVEST_HOME/cgi-bin/</em>.  The <tt>search.cgi</tt> script reads
files from the <em>$HARVEST_HOME/cgi-bin/lib/</em> directory, so check
that as well.  Finally, check the files in <em>$HARVEST_HOME/lib/</em>.
Some of the CGI Perl scripts require ``include'' files in this directory.

<sect2>Required modifications to your HTTP server

<p>
The Harvest Broker requires that an HTTP server is running, and that the
HTTP server ``knows'' about the Broker's files.  Below are some examples
of how to configure various HTTP servers to work with the Harvest
Broker.

<sect2>Apache httpd

<p>
Requires a <bf>ScriptAlias</bf> and an <bf>Alias</bf> entry in
<em>httpd.conf</em>, e.g.:

<tscreen><verb>
        ScriptAlias /Harvest/cgi-bin/ Your-HARVEST_HOME/cgi-bin/
        Alias /Harvest/ Your-HARVEST_HOME/
</verb></tscreen>

<em>WARNING:</em> The <bf>ScriptAlias</bf> entry must appear
<em>before</em> the <bf>Alias</bf> entry.

Additionally, it might be necessary to configure Apache httpd to follow
<em>symbolic links</em>.  To do this, add following to your
<em>httpd.conf</em>:

<tscreen><verb>
        &lt;Directory Your-HARVEST_HOME&gt;
                Options FollowSymLinks
        &lt;/Directory&gt;
</verb></tscreen>

<sect2>Other HTTP servers

<p>
Install the HTTP server and modify its configuration file so that the
<em>/Harvest</em> directory points to <em>$HARVEST_HOME</em>.  You will also
need to configure your HTTP server so that it knows that the directory
<em>/Harvest/cgi-bin</em> contains valid CGI programs.  If the default
behaviour of your HTTP server is not to follow symbolik links, you will need
to configure it so that it will follow symbolic links in the <em>/Harvest</em>
directory.

<sect1>Upgrading versions of the Harvest software

<p>

<sect2>Upgrading from version 1.6 to version 1.8

<p>
You <em>can not</em> install version 1.8 on top of version 1.6.  For example,
the change from version 1.6 to version 1.8 included some reorganization of
the executables, and hence simply installing version 1.8 on top of
version 1.6 would cause you to use old executables in some cases.

To upgrade from Harvest version 1.6 to 1.8, do:

<enum>
<item>Move your old installation to a temporary location.
<item>Install the new version as directed by the release notes.
<item>Then, for each Gatherer and Broker that you were running under
      the old installation, migrate the server into the new installation.

      <descrip>
      <tag/Gatherers:/
      you need to move the Gatherer's directory into
      <em>$HARVEST_HOME/gatherers</em>.  Section
      <ref id="RootNode specifications"
      name="RootNode specifications">
      describes the Gatherer workload specifications if you want to
      modify your Gatherer's configuration file.

      <tag/Brokers:/
      rebuild your broker by using <tt>CreateBroker</tt> and merge in
      any customizations you have made to your old Broker.
      </descrip>

</enum>

<sect2>Upgrading from version 1.5 to version 1.6

<p>
There are no known incompatibilities between versions 1.5 and 1.6.

<sect2>Upgrading from version 1.4 to version 1.5

<p>
You <em>can not</em> install version 1.5 on top of version 1.4.  For example,
the change from version 1.4 to version 1.5 included some reorganization of
the executables, and hence simply installing version 1.5 on top of
version 1.4 would cause you to use old executables in some cases.

To upgrade from Harvest version 1.4 to 1.5, do:

<enum>
<item>Move your old installation to a temporary location.
<item>Install the new version as directed by the release notes.
<item>Then, for each Gatherer and Broker that you were running under
      the old installation, migrate the server into the new installation.

      <descrip>
      <tag/Gatherers:/
      you need to move the Gatherer's directory into
      <em>$HARVEST_HOME/gatherers</em>.  Section
      <ref id="RootNode specifications"
      name="RootNode specifications"> describes the
      Gatherer workload specifications if you want to modify your
      Gatherer's configuration file.

      <tag/Brokers:/
      you need to move the Broker's directory into
      <em>$HARVEST_HOME/brokers</em>. Remove any <em>.glimpse_*</em>
      files from your Broker's directory and use the
      <em>admin.html</em> interface to force a full-index.  You may
      want, however, to rebuild your broker by using
      <tt>CreateBroker</tt> so that you can use the updated
      <em>query.html</em> and related files.
      </descrip>

</enum>

<sect2>Upgrading from version 1.3 to version 1.4

<p>
There are no known incompatibilities between versions 1.3 and 1.4.

<sect2>Upgrading from version 1.2 to version 1.3

<p>
Version 1.3 is mostly backwards compatible with 1.2, with the following
exception:

Harvest 1.3 uses Glimpse 3.0.  The <em>.glimpse_*</em> files
in the broker directory created with Harvest 1.2 (Glimpse 2.0)
are incompatible.  After installing Harvest 1.3 you should:

<enum>
<item>Shutdown any running brokers.
<item>Execute <tt>rm .glimpse_*</tt> in each broker directory.
<item>Restart your brokers with <tt>RunBroker</tt>.
<item>Force a full-index from the <em>admin.html</em> interface.
</enum>

<sect2>Upgrading from version 1.1 to version 1.2

<p>
There are a few incompatabilities between Harvest version 1.1
and version 1.2.

<itemize>
<item>The Gatherer has improved incremental gatherering support which
      is incompatible with version 1.1.  To update your existing
      Gatherer, change into the Gatherer's <em>Data-Directory</em>
      (usually the <em>data</em> subdirectory), and run the following
      command:

      <tscreen><verb>
        % set path = ($HARVEST_HOME/lib/gatherer $path)
        % cd data
        % rm -f INDEX.gdbm
        % mkindex
      </verb></tscreen>

      This should create the <em>INDEX.gdbm</em> and <em>MD5.gdbm</em>
      files in the current directory.
<item>The Broker has a new log format for the <em>admin/LOG</em>
      file which is incompatible with version 1.1.
</itemize>

<sect2>Upgrading to version 1.1 from version 1.0 or older

<p>
If you already have an older version of Harvest installed, and want to
upgrade, you <em>can not</em> unpack the new distribution on top of the old
one.  For example, the change from version 1.0 to version 1.1 included
some reorganization of the executables, and hence simply installing
version 1.1 on top of version 1.0 would cause you to use old executables
in some cases.

On the other hand, you may not want to start over from scratch with a new
software version, as that would not take advantage of the data you have
already gathered and indexed.  Instead, to upgrade from Harvest version 1.0 to
1.1, do the following:

<enum>
<item>Move your old installation to a temporary location.
<item>Install the new version as directed by the release notes.
<item>Then, for each Gatherer and Broker that you were running under
      the old installation, migrate the server into the new
      installation.

      <descrip>
      <tag/Gatherers:/
      you need to move the Gatherer's directory into
      <em>$HARVEST_HOME/gatherers</em>.  Section
      <ref id="RootNode specifications"
      name="RootNode specifications"> describes the
      new Gatherer workload specifications which were introduced in
      version 1.1; you may modify your Gatherer's configuration file
      to employ this new functionality.

      <tag/Brokers:/
      you need to move the Broker's directory into
      <em>$HARVEST_HOME/brokers</em>.  You may want, however, to
      rebuild your broker by using <tt>CreateBroker</tt> so that you
      can use the updated <em>query.html</em> and related files.
      </descrip>

</enum>

<sect1>Starting up the system: RunHarvest and related commands
<label id="Starting up the system: RunHarvest and related commands">

<p>
The simplest way to start the Harvest system is to use the <tt>RunHarvest</tt>
command.  <tt>RunHarvest</tt> prompts the user with a short list of questions
about what data to index, etc., and then creates and runs a Gatherer and
Broker with a ``stock'' (non-customized) set of content extraction and
indexing mechanisms.  Some more primitive commands are also available, for
starting individual Gatherers and Brokers (e.g., if you want to distribute
the gathering process).  The Harvest startup commands are:

<descrip>
<tag/RunHarvest/
Checks that the Harvest software is installed correctly, prompts the user for
basic configuration information, and then creates and runs a Gatherer and
a Broker.  If you have <em>$HARVEST_HOME</em> set, then it will use it;
otherwise, it tries to determine <em>$HARVEST_HOME</em> automatically.  Found
in the <em>$HARVEST_HOME</em> directory.

<tag/RunBroker/
Runs a Broker.  Found in the Broker's directory.

<tag/RunGatherer/
Runs a Gatherer.  Found in the Gatherer's directory.

<tag/CreateBroker/
Creates a single Broker which will collect its information from other existing
Brokers or Gatherers.  Used by <tt>RunHarvest</tt>, or can be run by a user to
create a new Broker.  Uses <em>$HARVEST_HOME</em>, and defaults to
<em>/usr/local/harvest</em>.  Found in the <em>$HARVEST_HOME/bin</em>
directory.
</descrip>

There is no <tt>CreateGatherer</tt> command, but the <tt>RunHarvest</tt>
command can create a Gatherer, or you can create a Gatherer manually (see
Section <ref id="Customizing the type recognition, candidate
selection, presentation unnesting, and summarizing steps"
name="Customizing the type recognition, candidate selection,
presentation unnesting, and summarizing steps"> or Section
<ref id="Gatherer Examples" name="Gatherer Examples">).  The layout of
the installed Harvest directories and programs is discussed in Section
<ref id="Programs and layout of the installed Harvest software"
name="Programs and layout of the installed Harvest software">.

Among other things, the <tt>RunHarvest</tt> command asks the user what port
numbers to use when running the Gatherer and the Broker.  By default,
the Gatherer will use port 8500 and the Broker will use the Gatherer
port plus 1.  The choice of port numbers depends on your particular
machine -- you need to choose ports that are not in use by other servers
on your machine.  You might look at your <em>/etc/services</em> file to see
what ports are in use (although this file only lists some servers; other
servers use ports without registering that information anywhere).
Usually the above port numbers will not be in use by other processes.
Probably the easiest thing is simply to try using the default port
numbers, and see if it works.

The remainder of this manual provides information for users who wish to
customize or otherwise make more sophisticated use of Harvest than what
happens when you install the system and run <tt>RunHarvest</tt>.

<sect1>Harvest team contact information

<p>
If you have questions the about Harvest system or problems with the
software, post a note to the USENET newsgroup
<url url="news:comp.infosystems.harvest" name="comp.infosystems.harvest">.
Please note your machine type, operating system type, and Harvest version
number in your correspondence.

If you have bug fixes, ports to new platforms or other software
improvements, please email them to the Harvest maintainer
<url url="mailto:lee@arco.de" name="lee@arco.de">.

<sect>The Gatherer
<label id="The Gatherer">

<p>

<sect1>Overview

<p>
The Gatherer retrieves information resources using a variety of
standard access methods (FTP, Gopher, HTTP, NNTP, and local files), and
then summarizes those resources in various type-specific ways to generate
structured indexing information.  For example, a Gatherer can retrieve a
technical report from an FTP archive, and then extract the
author, title, and abstract from the paper to summarize the technical
report.  Harvest Brokers or other search services can then retrieve the
indexing information from the Gatherer to use in a searchable index
available via a WWW interface.

The Gatherer consists of a number of separate components.
The <tt>Gatherer</tt> program reads a Gatherer configuration file
and controls the overall process of enumerating and summarizing
data objects.

The structured indexing information that the Gatherer collects is
represented as a list of attribute-value pairs using the <em>Summary Object
Interchange Format</em> (SOIF, see Section <ref id="The Summary Object
Interchange Format (SOIF)" name="The Summary Object Interchange Format
(SOIF)">).  The <tt>gatherd</tt> daemon serves the Gatherer database to Brokers.
It hangs around, in the background, after a gathering session is complete.
A stand-alone <tt>gather</tt> program is a client for the <tt>gatherd</tt>
server.  It can be used from the command line for testing, and is used
by the Broker.  The Gatherer uses a local disk cache to store objects it
has retrieved.  The disk cache is described in Section
<ref id="The local disk cache" name="The local disk cache">.

Even though the <tt>gatherd</tt> daemon remains in the background, a
Gatherer does not automatically update or refresh its summary objects.
Each object in a Gatherer has a Time-to-Live value.  Objects remain
in the database until they expire.  See Section <ref id="Periodic
gathering and realtime updates" name="Periodic gathering and realtime
updates"> for more information on keeping Gatherer objects up to date.

Several example Gatherers are provided with the Harvest software
distribution (see Section <ref id="Gatherer Examples" name="Gatherer
Examples">).

<sect1>Basic setup
<label id="Basic setup">

<p>
To run a basic Gatherer, you need only list the Uniform Resource Locators
(URLs, see <htmlurl url="http://www.ietf.org/rfc/rfc1630.txt" name="RFC1630"> and
<htmlurl url="http://www.ietf.org/rfc/rfc1738.txt" name="RFC1738">) from which
it will gather indexing information.  This list is specified in the
Gatherer configuration file, along with other optional information
such as the Gatherer's name and the directory in which it resides (see
Section <ref id="Setting variables in the Gatherer configuration file"
name="Setting variables in the Gatherer configuration file"> for
details on the optional information).  Below is an example Gatherer
configuration file:

<tscreen><verb>
        #
        #  sample.cf - Sample Gatherer Configuration File
        #
        Gatherer-Name:    My Sample Harvest Gatherer
        Gatherer-Port:    8500
        Top-Directory:    /usr/local/harvest/gatherers/sample

        &lt;RootNodes&gt;
        # Enter URLs for RootNodes here
        http://www.mozilla.org/
        http://www.xfree86.org/
        &lt;/RootNodes&gt;

        &lt;LeafNodes&gt;
        # Enter URLs for LeafNodes here
        http://www.arco.de/~kj/index.html
        &lt;/LeafNodes&gt;
</verb></tscreen>

As shown in the example configuration file, you may classify an URL as a
<bf>RootNode</bf> or a <bf>LeafNode</bf>.  For a LeafNode URL, the Gatherer
simply retrieves the URL and processes it.  LeafNode URLs are typically
files like PostScript papers or compressed ``tar'' distributions.  For a
RootNode URL, the Gatherer will expand it into zero or more LeafNode URLs
by recursively enumerating it in an access method-specific way.  For FTP or
Gopher, the Gatherer will perform a recursive directory listing on the
FTP or Gopher server to expand the RootNode (typically a directory
name).  For HTTP, a RootNode URL is expanded by following the
embedded HTML links to other URLs.  For News, the enumeration returns
all the messages in the specified USENET newsgroup.

PLEASE BE CAREFUL when specifying RootNodes as it is possible to
specify an enormous amount of work with a single RootNode URL.  To help
prevent a misconfigured Gatherer from abusing servers or running wildly,
by default the Gatherer will only expand a RootNode into 250 LeafNodes,
and will only include HTML links that point to documents that reside on
the same server as the original RootNode URL.  There are several options
that allow you to change these limits and otherwise enhance the Gatherer
specification.  See Section <ref id="RootNode specifications"
name="RootNode specifications"> for details.

The Gatherer is a
<htmlurl url="http://www.robotstxt.org/wc/robots.html"
name="``robot''"> and collects URLs starting from the URLs specified
in RootNodes.  It obeys the <em>robots.txt</em> convention and the
<em>robots META tag</em>.  It also is
<htmlurl url="http://www.ietf.org/rfc/rfc2616.txt" name="HTTP Version 1.1">
compliant and sends the <em>User-Agent</em> and <em>From</em> request
fields to HTTP servers for accountability.

After you have written the Gatherer configuration file, create a directory for
the Gatherer and copy the configuration file there.  Then, run the
<tt>Gatherer</tt> program with the configuration file as the only command-line
argument, as shown below:

<tscreen><verb>
        % Gatherer GathName.cf
</verb></tscreen>

The Gatherer will generate a database of the content summaries, a log file
(<em>log.gatherer</em>), and an error log file (<em>log.errors</em>).  It will
also start the <tt>gatherd</tt> daemon which exports the indexing information
automatically to Brokers and other clients.  To view the exported indexing
information, you can use the <tt>gather</tt> client program, as shown below:

<tscreen><verb>
        % gather localhost 8500 | more
</verb></tscreen>

The <bf>-info</bf> option causes the Gatherer to respond only with the
Gatherer summary information, which consists of the attributes available in
the specified Gatherer's database, the Gatherer's host and name, the range of
object update times, and the number of objects.  Compression is the default,
but can be disabled with the <bf>-nocompress</bf> option.  The optional
timestamp tells the Gatherer to send only the objects that have changed since
the specified timestamp (in seconds since the UNIX ``epoch'' of January 1,
1970).

<sect2>Gathering News URLs with NNTP

<p>
News URLs are somewhat different than the other access protocols because
the URL generally does not contain a hostname.  The Gatherer retrieves
News URLs from an NNTP server.  The name of this server must be placed
in the environment variable <em>$NNTPSERVER</em>.  It is probably a good
idea to add this to your <tt>RunGatherer</tt> script.  If the environment
variable is not set, the Gatherer attempts to connect to a host named
<em>news</em> at your site.

<sect2>Cleaning out a Gatherer

<p>
Remember the Gatherer databases persists between runs.  Objects
remain in the databases until they expire.  When experimenting
with the gatherer, it is always a good idea to ``clean out''
the databases between runs.  This is most easily accomplished
by executing this command from the Gatherer directory:

<tscreen><verb>
        % rm -rf data tmp log.*
</verb></tscreen>

<sect1>RootNode specifications
<label id="RootNode specifications">

<p>
The RootNode specification facility described in Section
<ref id="Basic setup" name="Basic setup"> provides a basic set of
default enumeration actions for RootNodes.  Often it is useful to
enumerate beyond the default limits, for example, to increase the
enumeration limit beyond 250 URLs, or to allow site boundaries to be
crossed when enumerating HTML links.  It is possible to specify these
and other aspects of enumeration, using the following syntax:

<tscreen><verb>
        &lt;RootNodes&gt;
        URL EnumSpec
        URL EnumSpec
        ...
        &lt;/RootNodes&gt;
</verb></tscreen>

where <em>EnumSpec</em> is on a single line (using ``<bf>\</bf>'' to
escape linefeeds), with the following syntax:

<tscreen><verb>
        URL=URL-Max[,URL-Filter-filename]  \
        Host=Host-Max[,Host-Filter-filename] \
        Access=TypeList \
        Delay=Seconds \
        Depth=Number \
        Enumeration=Enumeration-Program
</verb></tscreen>

The <em>EnumSpec</em> modifiers are all optional, and have the following
meanings:

<descrip>
<tag/URL-Max/
The number specified on the right hand side of the ``URL='' expression
lists the maximum number of LeafNode URLs to generate at all levels of
depth, from the current URL.  Note that <em>URL-Max</em> is the maximum
number of URLs that are generated during the enumeration, and <em>not</em>
a limit on how many URLs can pass through the candidate selection phase
(see Section <ref id="Customizing the candidate selection step"
name="Customizing the candidate selection step">).

<tag/URL-Filter-filename/
This is the name of a file containing a set of regular expression filters
(see Section <ref id="RootNode filters" name="RootNode filters">) to
allow or deny particular LeafNodes in the enumeration.  The default filter is
<em>$HARVEST_HOME/lib/gatherer/URL-filter-default</em> which excludes
many image and sound files.

<tag/Host-Max/
The number specified on the right hand side of the ``Host='' expression
lists the maximum number of hosts that will be touched during the
RootNode enumeration.  This enumeration actually counts hosts by IP
address so that aliased hosts are properly enumerated.  Note that this
does not work correctly for multi-homed hosts, or for hosts with
rotating DNS entries (used by some sites for load balancing heavily
accessed servers).

<em>Note:</em> Prior to Harvest Version 1.2 the ``Host=...'' line was
called ``Site=...''.  We changed the name to ``Host='' because it is
more intuitively meaningful (being a host count limit, not a site count
limit).  For backwards compatibility with older Gatherer configuration
files, we will continue to treat ``Site='' as an alias for ``Host=''.

<tag/Host-Filter-filename/
This is the name of a file containing a set of regular expression
filters to allow or deny particular hosts in the enumeration.  Each
expression can specify both a host name (or IP address) and a port
number (in case you have multiple servers running on different ports of
the same server and you want to index only one).  The syntax is
``hostname:port''.

<tag/Access/
If the RootNode is an HTTP URL, then you can specify which access methods
across which to enumerate.  Valid access method types are: <bf>FILE, FTP,
Gopher, HTTP, News, Telnet,</bf> or <bf>WAIS</bf>.  Use a ``<bf>|</bf>''
character between type names to allow multiple access methods.  For example,
``<bf>Access=HTTP|FTP|Gopher</bf>'' will follow HTTP, FTP, and Gopher
URLs while enumerating an HTTP RootNode URL.

<em>Note:</em> We do not support cross-method enumeration from Gopher,
because of the difficulty of ensuring that Gopher pointers do not cross
site boundaries.  For example, the Gopher URL
<em>gopher://powell.cs.colorado.edu:7005/1ftp3aftp.cs.washington.edu40pub/</em>
would get an FTP directory listing of ftp.cs.washington.edu:/pub, even though
the host part of the URL is powell.cs.colorado.edu.

<tag/Delay/
This is the number of seconds to wait between server contacts.  It
defaults to one second, when not specified otherwise.  <bf>Delay=3</bf>
will let the gatherer sleep 3 seconds between server contacts.

<tag/Depth/
This is the maximum number of levels of enumeration that will be followed
during gathering.  <bf>Depth=0</bf> means that there is <em>no</em> limit
to the depth of the enumeration.  <bf>Depth=1</bf> means the specified URL
will be retrieved, and all the URLs referenced by the specified URL will be
retrieved; and so on for higher Depth values.  In other words, the
enumeration will follow links up to <em>Depth</em> steps away from the
specified URL.

<tag/Enumeration-Program/
This modifier adds a very flexible way to control a Gatherer.  The
Enumeration-Program is a filter which reads URLs as input and writes new
enumeration parameters on output.  See section <ref id="Generic
Enumeration program description" name="Generic Enumeration program
description"> for specific details.
</descrip>

By default, <em>URL-Max</em> defaults to 250, <em>URL-Filter</em> defaults
to no limit, <em>Host-Max</em> defaults to 1, <em>Host-Filter</em> defaults
to no limit, <em>Access</em> defaults to HTTP only, <em>Delay</em> defaults
to 1 second, and <em>Depth</em> defaults to zero.  There is no way to specify
an unlimited value for <em>URL-Max</em> or <em>Host-Max</em>.

<sect2>RootNode filters
<label id="RootNode filters">

<p>
Filter files use the standard UNIX regular expression syntax (as defined
by the POSIX standard), not the csh ``globbing'' syntax.  For example,
you would use ``.*abc'' to indicate any string ending with ``abc'', not
``*abc''.  A filter file has the following syntax:

<tscreen><verb>
        Deny  regex
        Allow regex
</verb></tscreen>

The <em>URL-Filter</em> regular expressions are matched only on the
URL-path portion of each URL (the scheme, hostname and port are
excluded).  For example, the following URL-Filter file would allow all
URLs except those containing the regular expression
``<em>/gatherers/</em>'':

<tscreen><verb>
        Deny  /gatherers/
        Allow .
</verb></tscreen>

Another common use of URL-filters is to prevent the Gatherer from
travelling ``up'' a directory.  Automatically generated HTML pages
for HTTP and FTP directories often contain a link for
the parent directory ``<em>..</em>''.  To keep the gatherer
below a specific directory, use a URL-filter file such as:

<tscreen><verb>
        Allow ^/my/cool/sutff/
        Deny  .
</verb></tscreen>

The <em>Host-Filter</em> regular expressions are matched on the
``hostname:port'' portion of each URL.  Because the port
is included, you cannot use ``<bf>$</bf>'' to anchor the
end of a hostname.  Beginning with version 1.3, IP addresses
may be specified in place of hostnames.  A class B address
such as 128.138.0.0 would be written as ``<bf>^128\.138\..*</bf>''
in regular expression syntax.  For example:

<tscreen><verb>
        Deny   bcn.boulder.co.us:8080
        Deny   bvsd.k12.co.us
        Allow  ^128\.138\..*
        Deny   .
</verb></tscreen>

The order of the <bf>Allow</bf> and <bf>Deny</bf> entries is important,
since the filters are applied sequentially from first to last.  So,
for example, if you list ``<bf>Allow .*</bf>'' first, no subsequent
<bf>Deny</bf> expressions will be used, since this <bf>Allow</bf> filter
will allow all entries.

<sect2>Generic Enumeration program description
<label id="Generic Enumeration program description">

<p>
Flexible enumeration can be achieved by giving an
<bf>Enumeration=Enumeration-Program</bf> modifier to a RootNode URL.  The
<em>Enumeration-Program</em> is a filter which takes URLs on standard input
and writes new RootNode URLs on standard output.

The output format is different than specifying a RootNode URL in a Gatherer
configuration file.  Each output line must have nine fields separated by
spaces.  These fields are:

<tscreen><verb>
        URL
        URL-Max
        URL-Filter-filename
        Host-Max
        Host-Filter-filename
        Access
        Delay
        Depth
        Enumeration-Program
</verb></tscreen>

These are the same fields as described in section
<ref id="RootNode specifications" name="RootNode specifications">.
Values must be given for each field.  Use <em>/dev/null</em> to disable
the URL-Filter-filename and Host-Filter-filename.  Use <tt>/bin/false</tt>
to disable the Enumeration-Program.

<sect2>Example RootNode configuration

<p>
Below is an example RootNode configuration:

<tscreen><verb>
        &lt;RootNodes&gt;
  (1)   http://harvest.cs.colorado.edu/               URL=100,MyFilter
  (2)   http://www.cs.colorado.edu/                   Host=50 Delay=60
  (3)   gopher://gopher.colorado.edu/                 Depth=1
  (4)   file://powell.cs.colorado.edu/home/hardy/     Depth=2
  (5)   ftp://ftp.cs.colorado.edu/pub/cs/techreports/ Depth=1
  (6)   http://harvest.cs.colorado.edu/~hardy/hotlist.html \
                Depth=1 Delay=60
  (7)   http://harvest.cs.colorado.edu/~hardy/ \
                Depth=2 Access=HTTP|FTP
        &lt;/RootNodes&gt;
</verb></tscreen>

Each of the above RootNodes follows a different enumeration configuration as
follows:

<enum>
<item>This RootNode will gather up to 100 documents that pass through
      the URL name filters contained within the file <em>MyFilter</em>.
<item>This RootNode will gather the documents from up to the first 50
      hosts it encounters while enumerating the specified URL, with no
      limit on the Depth of link enumeration.  It will also wait for
      60 seconds between each retrieval.
<item>This RootNode will gather only the documents from the top-level
      menu of the Gopher server at <em>gopher.colorado.edu</em>.
<item>This RootNode will gather all documents that are in the
      <em>/home/hardy</em> directory, or that are in any subdirectory
      of <em>/home/hardy</em>.
<item>This RootNode will gather only the documents that are in the
      <em>/pub/techreports</em> directory which, in this case, is some
      bibliographic files rather than the technical reports themselves.
<item>This RootNode will gather all documents that are within 1 step
      away from the specified RootNode URL, waiting 60 seconds between
      each retrieval.  This is a good method by which to index your
      hotlist.  By putting an HTML file containing ``hotlist''
      pointers as this RootNode, this enumeration will gather the
      top-level pages to all of your hotlist pointers.
<item>This RootNode will gather all documents that are at most 2 steps
      away from the specified RootNode URL.  Furthermore, it will
      follow and enumerate any HTTP or FTP URLs that it encounters
      during enumeration.
</enum>

<sect2>Gatherer enumeration vs. candidate selection
<label id="Gatherer enumeration vs. candidate selection">

<p>
In addition to using the <em>URL-Filter</em> and <em>Host-Filter</em>
files for the RootNode specification mechanism described in Section
<ref id="RootNode specifications" name="RootNode specifications">, you
can prevent documents from being indexed through customizing the
<em>stoplist.cf</em> file, described in Section <ref id="Customizing
the type recognition, candidate selection, presentation unnesting, and
summarizing steps" name="Customizing the type recognition, candidate selection,
presentation unnesting, and summarizing steps">.  Since these mechanisms are
invoked at different times, they have different effects.  The
<em>URL-Filter</em> and <em>Host-Filter</em> mechanisms are invoked by the
Gatherer's ``RootNode'' enumeration programs.   Using these filters as stop
lists can prevent unwanted objects from being retrieved across the network.
This can dramatically reduce gathering time and network traffic.

The <em>stoplist.cf</em> file is used by the <em>Essence</em> content
extraction system (described in Section <ref id="Extracting data for
indexing: The Essence summarizing subsystem" name="Extracting data for
indexing: The Essence summarizing subsystem">) <em>after</em> the
objects are retrieved, to select which objects should be content
extracted and indexed.  This can be useful because Essence
provides a more powerful means of rejecting indexing candidates, in which
you can customize based not only file naming conventions but also on file
contents (e.g., looking at strings at the beginning of a file or at UNIX
``magic'' numbers), and also by more sophisticated file-grouping schemes
(e.g., deciding not to extract contents from object code files for which
source code is available).

As an example of combining these mechanisms, suppose you want to index the
``.ps'' files linked into your WWW site.  You could do this by having a
<em>stoplist.cf</em> file that contains ``HTML'', and a RootNode
<em>URL-Filter</em> that contains:

<tscreen><verb>
        Allow \.html
        Allow \.ps
        Deny  .*
</verb></tscreen>

As a final note, independent of these customizations the Gatherer
attempts to avoid retrieving objects where possible, by using a local
disk cache of objects, and by using the HTTP ``If-Modified-Since''
request header.  The local disk cache is described in
Section <ref id="The local disk cache" name="The local disk cache">.

<sect1>Generating LeafNode/RootNode URLs from a program

<p>
It is possible to generate RootNode or LeafNode URLs automatically from
program output.  This might be useful when gathering a large number of
Usenet newsgroups, for example.  The program is specified inside the
RootNode or LeafNode section, preceded by a pipe symbol.

<tscreen><verb>
        &lt;LeafNodes&gt;
        |generate-news-urls.sh
        &lt;/LeafNodes&gt;
</verb></tscreen>

The script must output valid URLs, such as

<tscreen><verb>
        news:comp.unix.voodoo
        news:rec.pets.birds
        http://www.nlanr.net/
        ...
</verb></tscreen>

In the case of RootNode URLs, enumeration parameters can be given after the
program.

<tscreen><verb>
        &lt;RootNodes&gt;
        |my-fave-sites.pl Depth=1 URL=5000,url-filter
        &lt;/RootNodes&gt;
</verb></tscreen>

<sect1>Extracting data for indexing: The Essence summarizing subsystem
<label id="Extracting data for indexing: The Essence summarizing subsystem">

<p>
After the Gatherer retrieves a document, it passes the document through
a subsystem called <em>Essence</em> to extract indexing information.
Essence allows the Gatherer to collect indexing information easily from
a wide variety of information, using different techniques depending on the
type of data and the needs of the particular corpus being indexed.  In a
nutshell, Essence can determine the type of data pointed to by a URL (e.g.,
PostScript vs. HTML), ``unravel'' presentation nesting formats (such as
compressed ``tar'' files), select which types of data to index (e.g.,
don't index Audio files), and then apply a type-specific extraction
algorithm (called a <em>summarizer</em>) to the data to generate a content
summary.  Users can customize each of these aspects, but often this is
not necessary.  Harvest is distributed with a ``stock'' set of type
recognizers, presentation unnesters, candidate selectors, and
summarizers that work well for many applications.

Below we describe the stock summarizer set, the current components
distribution, and how users can customize summarizers to change how they
operate and add summarizers for new types of data.  If you develop a
summarizer that is likely to be useful to other users, please notify us
via email at <url url="mailto:lee@arco.de" name="lee@arco.de"> so we
may include it in our Harvest distribution.

<tscreen><verb>
Type            Summarizer Function
--------------------------------------------------------------------
Bibliographic   Extract author and titles
Binary          Extract meaningful strings and manual page summary
C, CHeader      Extract procedure names, included file names, and comments
Dvi             Invoke the Text summarizer on extracted ASCII text
FAQ, FullText, README
                Extract all words in file
Font            Extract comments
HTML            Extract anchors, hypertext links, and selected fields
LaTex           Parse selected LaTex fields (author, title, etc.)
Mail            Extract certain header fields
Makefile        Extract comments and target names
ManPage         Extract synopsis, author, title, etc., based on ``-man'' macros
News            Extract certain header fields
Object          Extract symbol table
Patch           Extract patched file names
Perl            Extract procedure names and comments
PostScript      Extract text in word processor-specific fashion, and pass
                through Text summarizer.
RCS, SCCS       Extract revision control summary
RTF             Up-convert to HTML and pass through HTML summarizer
SGML            Extract fields named in extraction table
ShellScript     Extract comments
SourceDistribution
                Extract full text of README file and comments from Makefile
                and source code files, and summarize any manual pages
SymbolicLink    Extract file name, owner, and date created
TeX             Invoke the Text summarizer on extracted ASCII text
Text            Extract first 100 lines plus first sentence of each
                remaining paragraph
Troff           Extract author, title, etc., based on ``-man'', ``-ms'',
                ``-me'' macro packages, or extract section headers and
                topic sentences.
Unrecognized    Extract file name, owner, and date created.
</verb></tscreen>

<sect2>Default actions of ``stock'' summarizers

<p>
The table in Section <ref id="Extracting data for indexing: The
Essence summarizing subsystem" name="Extracting data for indexing: The
Essence summarizing subsystem"> provides a brief reference for how
documents are summarized depending on their type.  These actions can
be customized, as discussed in Section <ref id="Customizing the type
recognition, candidate selection, presentation unnesting, and
summarizing steps" name="Customizing the type recognition, candidate
selection, presentation unnesting, and summarizing steps">.  Some
summarizers are implemented as UNIX programs while others are
expressed as regular expressions; see Section <ref id="Customizing the
summarizing step" name="Customizing the summarizing step"> or Section
<ref id="Example 4" name="Example 4"> for more information about how
to write a summarizer.

<sect2>Summarizing SGML data
<label id="Summarizing SGML data">

<p>
It is possible to summarize documents that conform to the Standard
Generalized Markup Language (SGML), for which you have a Document Type
Definition (DTD).  The World Wide Web's Hypertext Mark-up Language (HTML)
is actually a particular application of SGML, with a corresponding DTD.
(In fact, the Harvest HTML summarizer can use the HTML DTD and our SGML
summarizing mechanism, which provides various advantages; see Section
<ref id="The SGML-based HTML summarizer" name="The SGML-based HTML
summarizer">.)  SGML is being used in an increasingly broad variety of
applications, for example as a format for storing data for a number of
physical sciences.  Because SGML allows documents to contain a good
deal of structure, Harvest can summarize SGML documents very
effectively.

The SGML summarizer (<tt>SGML.sum</tt>) uses the <tt>sgmls</tt> program by
James Clark to parse the SGML document.  The parser needs both a DTD for the
document and a Declaration file that describes the allowed character set.
The <tt>SGML.sum</tt> program uses a table that maps SGML tags to SOIF
attributes.

<sect3>Location of support files

<p>
SGML support files can be found in
<em>$HARVEST_HOME/lib/gatherer/sgmls-lib/</em>.  For example, these are the
default pathnames for HTML summarizing using the SGML summarizing mechanism:

<tscreen><verb>
        $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/html.dtd
        $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.decl
        $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl
</verb></tscreen>

The location of the DTD file must be specified in the <tt>sgmls</tt> catalog
(<em>$HARVEST_HOME/lib/gatherer/sgmls-lib/catalog</em>).  For example:

<tscreen><verb>
        DOCTYPE   HTML   HTML/html.dtd
</verb></tscreen>

The <tt>SGML.sum</tt> program looks for the <em>.decl</em> file in the default
location.  An alternate pathname can be specified with the <bf>-d</bf> option to
<tt>SGML.sum</tt>.

The summarizer looks for the <em>.sum.tbl</em> file first in the Gatherer's
lib directory and then in the default location.  Both of these can
be overridden with the <bf>-t</bf> option to <tt>SGML.sum</tt>.

<sect3>The SGML to SOIF table

<p>
The translation table provides a simple yet powerful way to specify how
an SGML document is to be summarized.  There are four ways to map SGML
data into SOIF.  The first two are concerned with placing the <em>content</em>
of an SGML tag into a SOIF attribute.

A simple SGML-to-SOIF mapping looks like this:

<tscreen><verb>
        &lt;TAG&gt;              soif1,soif2,...
</verb></tscreen>

This places the content that occurs inside the tag ``TAG'' into the SOIF
attributes ``soif1'' and ``soif2''.  It is possible to select different SOIF
attributes based on SGML attribute values.  For example, if ``ATT'' is
an attribute of ``TAG'', then it would be written like this:

<tscreen><verb>
        &lt;TAG,ATT=x&gt;         x-stuff
        &lt;TAG,ATT=y&gt;         y-stuff
        &lt;TAG&gt;               stuff
</verb></tscreen>

The second two mappings place values of SGML attributes into SOIF
attributes.  To place the value of the ``ATT'' attribute of the ``TAG'' tag
into the ``att-stuff'' SOIF attribute you would write:

<tscreen><verb>
        &lt;TAG:ATT&gt;           att-stuff
</verb></tscreen>

It is also possible to place the value of an SGML attribute
into a SOIF attribute named by a different SOIF attribute:

<tscreen><verb>
        &lt;TAG:ATT1&gt;          $ATT2
</verb></tscreen>

When the summarizer encounters an SGML attribute not listed in the
table, the content is passed to the parent tag and becomes a part of
the parent's content.  To force the content of some tag <em>not</em> to be
passed up, specify the SOIF attribute as ``ignore''.  To force the
content of some tag to be passed to the parent in addition to being
placed into a SOIF attribute, list an addition SOIF attribute named
``parent''.

Please see Section <ref id="The SGML-based HTML summarizer" name="The
SGML-based HTML summarizer"> for examples of these mappings.

<sect3>Errors and warnings from the SGML Parser

<p>
The <tt>sgmls</tt> parser can generate an overwhelming volume of error and
warning messages.  This will be especially true for HTML documents
found on the Internet, which often do not conform to the strict HTML
DTD.  By default, errors and warnings are redirected to <em>/dev/null</em> so
that they do not clutter the Gatherer's log files.  To enable logging of
these messages, edit the <tt>SGML.sum</tt> Perl script and set
<bf>$syntax_check = 1</bf>.

<sect3>Creating a summarizer for a new SGML-tagged data type

<p>
To create an SGML summarizer for a new SGML-tagged data type with
an associated DTD, you need to do the following:

<enum>
<item>Write a shell script named FOO.sum which simply contains

      <tscreen><verb>
        #!/bin/sh
        exec SGML.sum FOO $*
      </verb></tscreen>

<item>Modify the essence configuration files (as described in Section
      <ref id="Customizing the type recognition step"
      name="Customizing the type recognition step">) so that your
      documents get typed as FOO.
<item>Create the directory
      <em>$HARVEST_HOME/lib/gatherer/sgmls-lib/FOO/</em> and copy your
      DTD and Declaration there as FOO.dtd and FOO.decl.  Edit
      <em>$HARVEST_HOME/lib/gatherer/sgmls-lib/catalog</em> and add
      FOO.dtd to it.
<item>Create the translation table FOO.sum.tbl and place it with the
      DTD in <em>$HARVEST_HOME/lib/gatherer/sgmls-lib/FOO/</em>.
</enum>

At this point you can test everything from the command line as follows:

<tscreen><verb>
        % FOO.sum myfile.foo
</verb></tscreen>

<sect3>The SGML-based HTML summarizer
<label id="The SGML-based HTML summarizer">

<p>
Harvest can summarize HTML using the generic SGML summarizer described in
Section <ref id="Summarizing SGML data" name="Summarizing SGML data">.
The advantage of this approach is that the summarizer is more easily
customizable, and fits with the well-conceived SGML model (where you
define DTDs for individual document types and build interpretation
software to understand DTDs rather than individual document types).
The downside is that the summarizer is now pickier about syntax, and
many Web documents are not syntactically correct.  Because of this
pickiness, the default is for the HTML summarizer to run with syntax
checking outputs disabled.  If your documents are so badly formed that
they confuse the parser, this may mean the summarizing process
dies unceremoniously.  If you find that some of your HTML documents do not
get summarized or only get summarized in part, you can turn syntax-checking
output on by setting <bf>$syntax_check = 1</bf> in
<tt>$HARVEST_HOME/lib/gatherer/SGML.sum</tt>.  That will allow you to see
which documents are invalid and where.

Note that part of the reason for this problem is that Web browsers do not
insist on well-formed documents.  So, users can easily create documents
that are not completely valid, yet display fine.

Below is the default SGML-to-SOIF table used by the HTML summarizer:

<tscreen><verb>
HTML ELEMENT   SOIF ATTRIBUTES
------------   -----------------------
    &lt;A&gt;             keywords,parent
    &lt;A:HREF&gt;        url-references
    &lt;ADDRESS&gt;       address
    &lt;B&gt;             keywords,parent
    &lt;BODY&gt;          body
    &lt;CITE&gt;          references
    &lt;CODE&gt;          ignore
    &lt;EM&gt;            keywords,parent
    &lt;H1&gt;            headings
    &lt;H2&gt;            headings
    &lt;H3&gt;            headings
    &lt;H4&gt;            headings
    &lt;H5&gt;            headings
    &lt;H6&gt;            headings
    &lt;HEAD&gt;          head
    &lt;I&gt;             keywords,parent
    &lt;IMG:SRC&gt;       images
    &lt;META:CONTENT&gt;  $NAME
    &lt;STRONG&gt;        keywords,parent
    &lt;TITLE&gt;         title
    &lt;TT&gt;            keywords,parent
    &lt;UL&gt;            keywords,parent
</verb></tscreen>

The pathname to this file is
<em>$HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl</em>.

Individual Gatherers may do customized HTML summarizing by placing a
modified version of this file in the Gatherer <em>lib</em> directory.
Another way to customize is to modify the <tt>HTML.sum</tt> script and
add a <bf>-t</bf> option to the SGML.sum command.  For example:

<tscreen><verb>
        SGML.sum -t $HARVEST_HOME/lib/my-HTML.table HTML $*
</verb></tscreen>

In HTML, the document title is written as:

<tscreen><verb>
        &lt;TITLE&gt;My Home Page&lt;/TITLE&gt;
</verb></tscreen>

The above translation table will place this in the SOIF summary as:

<tscreen><verb>
        title{13}:  My Home Page
</verb></tscreen>

Note that ``keywords,parent'' occurs frequently in the table.  For any
specially marked text (bold, emphasized, hypertext links, etc.), the
words will be copied into the keywords attribute and also left in the
content of the parent element.  This keeps the body of the text readable
by not removing certain words.

Any text that appears inside a pair of CODE tags will not show up
in the summary because we specified ``ignore'' as the SOIF attribute.

URLs in HTML anchors are written as:

<tscreen><verb>
        &lt;A HREF=&quot;http://harvest.cs.colorado.edu/&quot;&gt;
</verb></tscreen>

The specification for <bf>&lt;A:HREF&gt;</bf> in the above translation
table causes this to appear as:

<tscreen><verb>
        url-references{32}: http://harvest.cs.colorado.edu/
</verb></tscreen>

<sect3>Adding META data to your HTML

<p>
One of the most useful HTML tags is META.  This allows the document writer
to include arbitrary metadata in an HTML document.  A Typical usage of
the META element is:

<tscreen><verb>
        &lt;META NAME=&quot;author&quot; CONTENT=&quot;Joe T. Slacker&quot;&gt;
</verb></tscreen>

By specifying ``<bf>&lt;META:CONTENT&gt;</bf> $NAME'' in the
translation table, this comes out as:

<tscreen><verb>
        author{15}: Joe T. Slacker
</verb></tscreen>

Using the META tags, HTML authors can easily add a list of keywords to
their documents:

<tscreen><verb>
        &lt;META NAME=&quot;keywords&quot; CONTENT=&quot;word1 word2&quot;&gt;
        &lt;META NAME=&quot;keywords&quot; CONTENT=&quot;word3 word4&quot;&gt;
</verb></tscreen>

<sect3>Other examples

<p>
A very terse HTML summarizer could be specified with a table that only
puts emphasized words into the keywords attribute:

<tscreen><verb>
HTML ELEMENT   SOIF ATTRIBUTES
------------   -----------------------
    &lt;A&gt;             keywords
    &lt;B&gt;             keywords
    &lt;EM&gt;            keywords
    &lt;H1&gt;            keywords
    &lt;H2&gt;            keywords
    &lt;H3&gt;            keywords
    &lt;I&gt;             keywords
    &lt;META:CONTENT&gt;  $NAME
    &lt;STRONG&gt;        keywords
    &lt;TITLE&gt;         title,keywords
    &lt;TT&gt;            keywords
</verb></tscreen>

Conversely, a full-text summarizer can be easily specified with only:

<tscreen><verb>
HTML ELEMENT   SOIF ATTRIBUTES
------------   -----------------------
    &lt;HTML&gt;          full-text
    &lt;TITLE&gt;         title,parent
</verb></tscreen>

<sect2>Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps
<label id="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps">

<p>
The Harvest Gatherer's actions are defined by a set of configuration and
utility files, and a corresponding set of executable programs referenced by
some of the configuration files.

If you want to customize a Gatherer, you should create <em>bin</em> and
<em>lib</em> subdirectories in the directory where you are running the
Gatherer, and then copy <em>$HARVEST_HOME/lib/gatherer/*.cf</em> and
<em>$HARVEST_HOME/lib/gatherer/magic</em> into your <em>lib</em> directory.
Then add to your Gatherer configuration file:

<tscreen><verb>
        Lib-Directory:         lib
</verb></tscreen>

The details about what each of these files does are described below.
The basic contents of a typical Gatherer's directory is as follows
(note: some of the file names below can be changed by setting
variables in the Gatherer configuration file, as described in
Section <ref id="Setting variables in the Gatherer configuration file"
name="Setting variables in the Gatherer configuration file">):

<tscreen><verb>
        RunGatherd*    bin/           GathName.cf    log.errors     tmp/
        RunGatherer*   data/          lib/           log.gatherer

        bin:
        MyNewType.sum*

        data:
        All-Templates.gz    INFO.soif    PRODUCTION.gdbm    gatherd.log
        INDEX.gdbm          MD5.gdbm     gatherd.cf

        lib:
        bycontent.cf   byurl.cf       quick-sum.cf
        byname.cf      magic          stoplist.cf

        tmp:
</verb></tscreen>

The <tt>RunGatherd</tt> and <tt>RunGatherer</tt> are used to export the
Gatherer's database after a machine reboot and to run the Gatherer,
respectively.  The <em>log.errors</em> and <em>log.gatherer</em> files
contain error messages and the output of the <em>Essence</em> typing step,
respectively (Essence will be described shortly).  The <em>GathName.cf</em>
file is the Gatherer's configuration file.

The <em>bin</em> directory contains any summarizers and any other program
needed by the summarizers.  If you were to customize the Gatherer by
adding a summarizer, you would place those programs in this
<em>bin</em> directory; the <tt>MyNewType.sum</tt> is an example.

The <em>data</em> directory contains the Gatherer's database which
<tt>gatherd</tt> exports.  The Gatherer's database consists of the
<em>All-Templates.gz, INDEX.gdbm, INFO.soif, MD5.gdbm</em> and
<em>PRODUCTION.gdbm</em> files.  The <em>gatherd.cf</em> file is used
to support access control as described in Section <ref id="Controlling
access to the Gatherer's database" name="Controlling access to the
Gatherer's database">.  The <em>gatherd.log</em> file is where the
<tt>gatherd</tt> program logs its information.

The <em>lib</em> directory contains the configuration files used by the
Gatherer's subsystems, namely Essence.  These files are described briefly
in the following table:

<tscreen><verb>
        bycontent.cf    Content parsing heuristics for type recognition step
        byname.cf       File naming heuristics for type recognition step
        byurl.cf        URL naming heuristics for type recognition step
        magic           UNIX ``file'' command specifications (matched against
                        bycontent.cf strings)
        quick-sum.cf    Extracts attributes for summarizing step.
        stoplist.cf     File types to reject during candidate selection
</verb></tscreen>

<sect3>Customizing the type recognition step
<label id="Customizing the type recognition step">

<p>
Essence recognizes types in three ways (in order of precedence):
by URL naming heuristics, by file naming heuristics, and by locating
<em>identifying</em> data within a file using the UNIX <tt>file</tt> command.

To modify the type recognition step, edit <em>lib/byname.cf</em> to add file
naming heuristics, or <em>lib/byurl.cf</em> to add URL naming heuristics, or
<em>lib/bycontent.cf</em> to add by-content heuristics.  The by-content
heuristics match the output of the UNIX <tt>file</tt> command, so you may
also need to edit the <em>lib/magic</em> file.  See Section
<ref id="Example 3" name="Example 3"> and <ref id="Example 4"
name="Example 4"> for detailed examples on how to customize the type
recognition step.

<sect3>Customizing the candidate selection step
<label id="Customizing the candidate selection step">

<p>
The <em>lib/stoplist.cf</em> configuration file contains a list of types
that are rejected by Essence.  You can add or delete types from
<em>lib/stoplist.cf</em> to control the candidate selection step.

To direct Essence to index only certain types, you can list the types to
index in <em>lib/allowlist.cf</em>.  Then, supply Essence with the
<bf>--allowlist</bf> flag.

The file and URL naming heuristics used by the type recognition step
(described in Section <ref id="Customizing the type recognition step"
name="Customizing the type recognition step">) are particularly
useful for candidate selection when gathering remote data.  They allow
the Gatherer to avoid retrieving files that you don't want to index
(in contrast, recognizing types by locating identifying data within a file
requires that the file be retrieved first).  This approach can save quite
a bit of network traffic, particularly when used in combination with
enumerated <em>RootNode</em> URLs.  For example, many sites provide each of
their files in both a compressed and uncompressed form.  By building a
<em>lib/allowlist.cf</em> containing only the Compressed types, you can avoid
retrieving the uncompressed versions of the files.

<sect3>Customizing the presentation unnesting step
<label id="Customizing the presentation unnesting step">

<p>
Some types are declared as ``nested'' types.  Essence treats these
differently than other types, by running a presentation unnesting algorithm
or ``Exploder'' on the data rather than a Summarizer.  At present
Essence can handle files nested in the following formats:

<enum>
<item>binhex
<item>uuencode
<item>shell archive (``shar'')
<item>tape archive (``tar'')
<item>bzip2 compressed (``bzip2'')
<item>compressed
<item>GNU compressed (``gzip'')
<item>zip compressed archive
</enum>

To customize the presentation unnesting step you can modify the Essence
source file <em>src/gatherer/essence/unnest.c</em>.  This file lists the
available presentation encodings, and also specifies the unnesting algorithm.
Typically, an external program is used to unravel a file into one or more
component files (e.g. <tt>bzip2, gunzip, uudecode,</tt> and <tt>tar</tt>).

An <em>Exploder</em> may also be used to explode a file into a stream of SOIF
objects.  An Exploder program takes a URL as its first command-line argument
and a file containing the data to use as its second, and then generates one
or more SOIF objects as output.  For your convenience, the <em>Exploder</em>
type is already defined as a nested type.  To save some time, you can use
this type and its corresponding <tt>Exploder.unnest</tt> program rather than
modifying the Essence code.

See Section <ref id="Example 2" name="Example 2"> for a detailed
example on writing an Exploder.  The <em>unnest.c</em> file also
contains further information on defining the unnesting algorithms.

<sect3>Customizing the summarizing step
<label id="Customizing the summarizing step">

<p>
Essence supports two mechanisms for defining the type-specific extraction
algorithms (called <em>Summarizers</em>) that generate content summaries:
a UNIX program that takes as its only command line argument the filename of
the data to summarize, and line-based regular expressions specified in
<em>lib/quick-sum.cf</em>.  See Section <ref id="Example 4"
name="Example 4"> for detailed examples on how to define both types of
Summarizers.

The UNIX Summarizers are named using the convention <tt>TypeName.sum</tt>
(e.g., <tt>PostScript.sum</tt>).  These Summarizers output their content
summary in a SOIF attribute-value list (see Section
<ref id="The Summary Object Interchange Format (SOIF)" name="The
Summary Object Interchange Format (SOIF)">).  You can use the
<tt>wrapit</tt> command to wrap raw output into the SOIF format (i.e.,
to provide byte-count delimiters on the individual attribute-value pairs).

There is a summarizer called <tt>FullText.sum</tt> that you can use to
perform full text indexing of selected file types, by simply setting up
the <em>lib/bycontent.cf</em> and <em>lib/byname.cf</em> configuration files
to recognize the desired file types as FullText (i.e., using ``FullText''
in column 1 next to the matching regular expression).

<sect1>Post-Summarizing: Rule-based tuning of object summaries

<p>
It is possible to ``fine-tune'' the summary information generated by the
Essence summarizers.  A typical application of this would be to change the
<em>Time-to-Live</em> attribute based on some knowledge about the objects.
So an administrator could use the post-summarizing feature to give
quickly-changing objects a lower TTL, and very stable documents a higher TTL.

Objects are selected for post-summarizing if they meet a specified condition.
A condition consists of three parts:  An attribute name, an operation, and
some string data.  For example:

<tscreen><verb>
        city == 'New York'
</verb></tscreen>

In this case we are checking if the <em>city</em> attribute is equal to
the string `New York'.  For exact string matching, the string data must
be enclosed in single quotes.  Regular expressions are also supported:

<tscreen><verb>
        city ~ /New York/
</verb></tscreen>

Negative operators are also supported:

<tscreen><verb>
        city != 'New York'
        city !~ /New York/
</verb></tscreen>

Conditions can be joined with `<bf>&amp;&amp;</bf>' (logical and) or
`<bf>||</bf>' (logical or) operators:

<tscreen><verb>
        city == 'New York' &amp;&amp; state != 'NY';
</verb></tscreen>

When all conditions are met for an object, some number of instructions are
executed on it.  There are four types of instructions which can be specified:

<enum>
<item>Set an attribute exactly to some specific string.

      Example:

      <tscreen><verb>
        time-to-live = &quot;86400&quot;
      </verb></tscreen>

<item>Filter an attribute through some program.  The attribute value is given
      as input to the filter.  The output of the filter becomes the new attribute
      value.

      Example:

      <tscreen><verb>
        keywords | tr A-Z a-z
      </verb></tscreen>

<item>Filter multiple attributes through some program.  In this case the filter
      must read and write attributes in the SOIF format.

      Example:

      <tscreen><verb>
        address,city,state,zip ! cleanup-address.pl
      </verb></tscreen>

<item>A special case instruction is to delete an object.  To do this, simply
      write:

      <tscreen><verb>
        delete()
      </verb></tscreen>

</enum>

<sect2>The Rules file

<p>
The conditions and instructions are combined together in a ``rules'' file.
The format of this file is somewhat similar to a Makefile; conditions begin
in the first column and instructions are indented by a tab-stop.

Example:

<tscreen><verb>
        type == 'HTML'
                partial-text | cleanup-html-text.pl

        URL ~ /users/
                time-to-live = &quot;86400&quot;
                partial-text ! extract-owner.sh

        type == 'SOIFStream'
                delete()
</verb></tscreen>

This rules file is specified in the gatherer.cf file with the
Post-Summarizing tag, e.g.:

<tscreen><verb>
        Post-Summarizing: lib/myrules
</verb></tscreen>

<sect2>Rewriting URLs

<p>
Until version 1.4 it was not possible to rewrite the URL-part of an object
summary.  It is now possible, but only by using the ``pipe'' instruction.
This may be useful for people wanting to run a Gatherer on <em>file://</em>
URLs, but have them appear as <em>http://</em> URLs.  This can be done with
a post-summarizing rule such as:

<tscreen><verb>
        url ~ 'file://localhost/web/htdocs/'
                url | fix-url.pl
</verb></tscreen>

And the 'fix-url.pl' script might look like:

<tscreen><verb>
        #!/usr/local/bin/perl -p
        s'file://localhost/web/htdocs/'http://www.my.domain/';
</verb></tscreen>

<sect1>Gatherer administration

<p>

<sect2>Setting variables in the Gatherer configuration file
<label id="Setting variables in the Gatherer configuration file">

<p>
In addition to customizing the steps described in Section <ref
id="Customizing the type recognition, candidate selection,
presentation unnesting, and summarizing steps" name="Customizing the
type recognition, candidate selection, presentation unnesting, and
summarizing steps">, you can customize the Gatherer by setting
variables in the Gatherer configuration file.  This file consists of two
parts: a list of variables that specify information about the Gatherer
(such as its name, host, and port number), and two lists of URLs (divided
into <bf>RootNodes</bf> and <bf>LeafNodes</bf>) from which to collect
indexing information.  Section <ref id="Basic setup" name="Basic
setup"> shows an example Gatherer configuration file.  In this section
we focus on the variables that the user can set in the first part of
the Gatherer configuration file.

Each variable name starts in the first column, ends with a colon, then is
followed by the value.  The following table shows the supported variables:

<tscreen><verb>
        Access-Delay:           Default delay between URLs accesses.
        Data-Directory:         Directory where GDBM database is written.
        Debug-Options:          Debugging options passed to child programs.
        Errorlog-File:          File for logging errors.
        Essence-Options:        Any extra options to pass to Essence.
        FTP-Auth:               Username/password for protected FTP documents.
        Gatherd-Inetd:          Denotes that gatherd is run from inetd.
        Gatherer-Host:          Full hostname where the Gatherer is run.
        Gatherer-Name:          A Unique name for the Gatherer.
        Gatherer-Options:       Extra options for the Gatherer.
        Gatherer-Port:          Port number for gatherd.
        Gatherer-Version:       Version string for the Gatherer.
        HTTP-Basic-Auth:        Username/password for protected HTTP documents.
        HTTP-Proxy:             host:port of your HTTP proxy.
        Keep-Cache:             ``yes'' to not remove local disk cache.
        Lib-Directory:          Directory where configuration files live.
        Local-Mapping:          Mapping information for local gathering.
        Log-File:               File for logging progress.
        Post-Summarizing:       A rules-file for post-summarizing.
        Refresh-Rate:           Object refresh-rate in seconds, default 1 week.
        Time-To-Live:           Object time-to-live in seconds, default 1 month.
        Top-Directory:          Top-level directory for the Gatherer.
        Working-Directory:      Directory for tmp files and local disk cache.
</verb></tscreen>

Notes:

<itemize>
<item>We recommend that you use the <bf>Top-Directory</bf> variable,
      since it will set the <bf>Data-Directory</bf>,
      <bf>Lib-Directory</bf>, and <bf>Working-Directory</bf> variables.
<item>Both <bf>Working-Directory</bf> and <bf>Data-Directory</bf> will
      have files in them after the Gatherer has run.  The
      <bf>Working-Directory</bf> will hold the local-disk cache that
      the Gatherer uses to reduce network I/O, and the
      <bf>Data-Directory</bf> will hold the GDBM databases that
      contain the content summaries.
<item>You should use full rather than relative pathnames.
<item>All variable definitions <em>must</em> come before the RootNode
      or LeafNode URLs.
<item>Any line that starts with a ``#'' is a comment.
<item><bf>Local-Mapping</bf> is discussed in Section <ref
      id="Local file system gathering for reduced CPU load"
      name="Local file system gathering for reduced CPU load">.
<item><bf>HTTP-Proxy</bf> will retrieve HTTP URLs via a proxy host.
      The syntax is <bf>hostname:port</bf>; for example,
      <bf>proxy.yoursite.com:3128</bf>.
<item><bf>Essence-Options</bf> is particularly useful, as it lets you
      customize basic aspects of the Gatherer easily.
<item>The only valid <bf>Gatherer-Options</bf> is
      <bf>--save-space</bf> which directs the Gatherer to be more
      space efficient when preparing its database for export.
<item>The <tt>Gatherer</tt> program will accept the
      <bf>-background</bf> flag which will cause the Gatherer to
      run in the background.
</itemize>

The Essence options are:

<tscreen><verb>
Option                  Meaning
--------------------------------------------------------------------
--allowlist filename    File with list of types to allow
--fake-md5s             Generates MD5s for SOIF objects from a .unnest program
--fast-summarizing      Trade speed for some consistency.  Use only when
                        an external summarizer is known to generate clean,
                        unique attributes.
--full-text             Use entire file instead of summarizing.  Alternatively,
                        you can perform full text indexing of individual file
                        types by using the FullText.sum summarizer.
--max-deletions n       Number of GDBM deletions before reorganization
--minimal-bookkeeping   Generates a minimal amount of bookkeeping attrs
--no-access             Do not read contents of objects
--no-keywords           Do not automatically generate keywords
--stoplist filename     File with list of types to remove
--type-only             Only type data; do not summarize objects
</verb></tscreen>

A particular note about full text summarizing: Using the Essence
<bf>--full-text</bf> option causes files not to be passed through the
Essence content extraction mechanism.  Instead, their entire content is
included in the SOIF summary stream.  In some cases this may produce
unwanted results (e.g., it will directly include the PostScript for a
document rather than first passing the data through a PostScript to text
extractor, providing few searchable terms and large SOIF objects).  Using
the individual file type summarizing mechanism described in Section
<ref id="Customizing the summarizing step" name="Customizing the
summarizing step"> will work better in this regard, but will require
you to specify how data are extracted for each individual file
type.  In a future version of Harvest we will change the Essence
<bf>--full-text</bf> option to perform content extraction before
including the full text of documents.

<sect2>Local file system gathering for reduced CPU load
<label id="Local file system gathering for reduced CPU load">

<p>
Although the Gatherer's work load is specified using URLs, often the
files being gathered are located on a local file system.  In this case
it is much more efficient to gather directly from the local file system
than via FTP/Gopher/HTTP/News, primarily because of all the UNIX
forking required to gather information via these network processes.  For
example, our measurements indicate it causes from 4-7x more CPU load to
gather from FTP than directly from the local file system.  For large
collections (e.g., archive sites containing many thousands of files),
the CPU savings can be considerable.

Starting with Harvest Version 1.1, it is possible to tell the Gatherer how to
translate URLs to local file system names, using the <bf>Local-Mapping</bf>
Gatherer configuration file variable (see Section <ref id="Setting
variables in the Gatherer configuration file" name="Setting variables
in the Gatherer configuration file">).  The syntax is:

<tscreen><verb>
        Local-Mapping: URL_prefix local_path_prefix
</verb></tscreen>

This causes all URLs starting with <bf>URL_prefix</bf> to be translated to
files starting with the prefix <bf>local_path_prefix</bf> while gathering,
but to be left as URLs in the results of queries (so the objects can be
retrieved as usual).  Note that no regular expressions are supported here.
As an example, the specification

<tscreen><verb>
        Local-Mapping: http://harvest.cs.colorado.edu/~hardy/ /homes/hardy/public_html/
        Local-Mapping: ftp://ftp.cs.colorado.edu/pub/cs/ /cs/ftp/
</verb></tscreen>

would cause the URL
<em>http://harvest.cs.colorado.edu/&#126;hardy/Home.html</em> to be
translated to the local file name
<em>/homes/hardy/public_html/Home.html</em>, while the URL
<em>ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/Harvest.Conf.ps.Z</em>
would be translated to the local file name
<em>/cs/ftp/techreports/schwartz/Harvest.Conf.ps.Z</em>.

Local gathering will work over NFS file systems.  A local mapping will
fail if: the local file cannot be opened for reading; or the local
file is not a regular file; or the local file has execute bits set.
So, for directories, symbolic links and CGI scripts, the server is
always contacted rather than the local file system.  Lastly, the
Gatherer does not perform any URL syntax translations for local
mappings.  If your URL has characters that should be escaped (as in
<htmlurl url="http://www.ietf.org/rfc/rfc1738.txt" name="RFC1738">), then the
local mapping will fail.  Starting with version 1.4 patchlevel 2
Essence will print <em>[L]</em> after URLs which were successfully
accessed locally.

Note that if your network is highly congested, it may actually be faster
to gather via HTTP/FTP/Gopher than via NFS, because NFS becomes very
inefficient in highly congested situations.  Even better would be to run
local Gatherers on the hosts where the disks reside, and access them
directly via the local file system.

<sect2>Gathering from password-protected servers

<p>
You can gather password-protected documents from HTTP and FTP servers.
In both cases, you can specify a username and password as a part of the
URL.  The format is as follows:

<tscreen><verb>
         ftp://user:password@host:port/url-path
        http://user:password@host:port/url-path
</verb></tscreen>

With this format, the ``user:password'' part is kept as a part
of the URL string all throughout Harvest.  This may enable
anyone who uses your Broker(s) to access password-protected
documents.

You can keep the username and password information ``hidden''  by
specifying the authentication information in the Gatherer configuration
file.  For HTTP, the format is as follows:

<tscreen><verb>
        HTTP-Basic-Auth: realm username password
</verb></tscreen>

where <bf>realm</bf> is the same as the <bf>AuthName</bf> parameter given
in an Apache httpd <em>httpd.conf</em> or <em>.htaccess</em> file.  In other
httpd server configuration, the realm value is sometimes called
<bf>ServerId</bf>.

For FTP, the format in the gatherer.cf file is

<tscreen><verb>
        FTP-Auth: hostname[:port] username password
</verb></tscreen>

<sect2>Controlling access to the Gatherer's database
<label id="Controlling access to the Gatherer's database">

<p>
You can use the <em>gatherd.cf</em> file (placed in the
<bf>Data-Directory</bf> of a Gatherer) to control access to the Gatherer's
database.  A line that begins with <bf>Allow</bf> is followed by any number
of domain or host names that are allowed to connect to the Gatherer.  If the
word <bf>all</bf> is used, then all hosts are matched.  <bf>Deny</bf> is the
opposite of <bf>Allow</bf>.  The following example will only allow hosts in
the <bf>cs.colorado.edu</bf> or <bf>usc.edu</bf> domain access the Gatherer's
database:

<tscreen><verb>
        Allow  cs.colorado.edu usc.edu
        Deny   all
</verb></tscreen>

<sect2>Periodic gathering and realtime updates
<label id="Periodic gathering and realtime updates">

<p>
The <tt>Gatherer</tt> program does not automatically do any periodic
updates -- when you run it, it processes the specified URLs, starts up
a <tt>gatherd</tt> daemon (if one isn't already running), and then exits.
If you want to update the data periodically (e.g., to capture new files as
they are added to an FTP archive), you need to use the UNIX <tt>cron</tt>
command to run the <tt>Gatherer</tt> program at some regular interval.

To set up periodic gathering via <tt>cron</tt>, use the <tt>RunGatherer</tt>
command that <tt>RunHarvest</tt> will create.  An example <tt>RunGatherer</tt>
script follows:

<tscreen><verb>
        #!/bin/sh
        #
        #  RunGatherer - Runs the ATT 800 Gatherer (from cron)
        #
        HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME
        PATH=${HARVEST_HOME}/bin:${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/lib:$PATH
        export PATH
        NNTPSERVER=localhost; export NNTPSERVER
        cd /usr/local/harvest/gatherers/att800
        exec Gatherer "att800.cf"
</verb></tscreen>

You should run the <tt>RunGatherd</tt> command from your system startup
(e.g. <em>/etc/rc.local</em>) file, so the Gatherer's database is exported
each time the machine reboots.  An example <tt>RunGatherd</tt> script
follows:

<tscreen><verb>
        #!/bin/sh
        #
        #  RunGatherd - starts up the gatherd process (from /etc/rc.local)
        #
        HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME
        PATH=${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/bin:$PATH; export PATH
        exec gatherd -d /usr/local/harvest/gatherers/att800/data 8500
</verb></tscreen>

<sect2>The local disk cache
<label id="The local disk cache">

<p>
The Gatherer maintains a local disk cache of files it gathers to reduce
network traffic from restarting aborted gathering attempts.  However, since
the remote server must still be contacted whenever <tt>Gatherer</tt> runs,
please do not set your cron job to run <tt>Gatherer</tt> frequently.  A
typical value might be weekly or monthly, depending on how congested the
network and how important it is to have the most current data.

By default, the Gatherer's local disk cache is deleted after each
successful completion.  To save the local disk cache between Gatherer
sessions, define <bf>Keep-Cache: yes</bf> in your Gatherer configuration
file (Section <ref id="Setting variables in the Gatherer configuration
file" name="Setting variables in the Gatherer configuration file">).

If you want your Broker's index to reflect new data, then you
must run the Gatherer <em>and</em> run a Broker collection.  By default,
a Broker will perform collections once a day.  If you want the Broker
to collect data as soon as it's gathered, then you will need to
coordinate the timing of the completion of the Gatherer and the Broker
collections.

If you run your Gatherer frequently and you use the <bf>Keep-Cache: yes</bf>
in your Gatherer configuration file, then the Gatherer's local disk cache
may interfere with retrieving updates.  By default, objects in the
local disk cache expire after 7 days; however, you can expire objects
more quickly by setting the <bf>$GATHERER_CACHE_TTL</bf> environment
variable to the number of seconds for the Time-To-Live (TTL) before you
run the Gatherer, or you can change <tt>RunGatherer</tt> to remove the
Gatherer's <em>tmp</em> directory after each Gatherer run.  For example, to
expire objects in the local disk cache after one day:

<tscreen><verb>
        % setenv GATHERER_CACHE_TTL 86400       # one day
        % ./RunGatherer
</verb></tscreen>

The Gatherer's local disk cache size defaults to 32 MBs,
but you can change this value by setting the <bf>$HARVEST_MAX_LOCAL_CACHE</bf>
environment variable to the number of MBs before you run the Gatherer.  For
example, to have a maximum cache of 10 MB you can do as follows:

<tscreen><verb>
        % setenv HARVEST_MAX_LOCAL_CACHE 10       # 10 MB
        % ./RunGatherer
</verb></tscreen>

If you have access to the software that creates the files that you are
indexing (e.g., if all updates are funneled through a particular editor,
update script, or system call), you can modify this software to schedule
realtime Gatherer updates whenever a file is created or updated.  For
example, if all users update the files being indexed using a particular
program, this program could be modified to run the Gatherer upon
completion of the user's update.

Note that, when used in conjunction with <tt>cron</tt>, the Gatherer
provides a powerful data ``mirroring'' facility.  You can use the Gatherer
to replicate the contents of one or more sites, retrieve data in multiple
formats via multiple protocols (FTP, HTTP, etc.), optionally perform a
variety of type- or site-specific transformations on the data, and serve
the results very efficiently as compressed SOIF object summary streams to
other sites that wish to use the data for building indexes or for other
purposes.

<sect2>Incorporating manually generated information into a Gatherer
<label id="Incorporating manually generated information into a Gatherer">

<p>
You may want to inspect the quality of the automatically-generated SOIF
templates.  In general, Essence's techniques for automatic information
extraction produce imperfect results.  Sometimes it is possible to
customize the summarizers to better suit the particular context (see
Section <ref id="Customizing the summarizing step" name="Customizing
the summarizing step">).  Sometimes, however, it makes sense to
augment or change the automatically generated keywords with manually
entered information.  For example, you may want to add <em>Title</em>
attributes to the content summaries for a set of PostScript documents
(since it's difficult to parse them out of PostScript automatically).

Harvest provides some programs that automatically clean up a Gatherer's
database.  The <tt>rmbinary</tt> program removes any binary data from the
templates.  The <tt>cleandb</tt> program does some simple validation of SOIF
objects, and when given the <bf>-truncate</bf> flag it will truncate
the <em>Keywords</em> data field to 8 kilobytes.  To help in manually
managing the Gatherer's databases, the <tt>gdbmutil</tt> GDBM database
management tool is provided in <em>$HARVEST_HOME/lib/gatherer</em>.

In a future release of Harvest we will provide a forms-based mechanism to
make it easy to provide manual annotations.  In the meantime, you can
annotate the Gatherer's database with manually generated information by
using the <tt>mktemplate</tt>, <tt>template2db</tt>, <tt>mergedb</tt>, and
<tt>mkindex</tt> programs.  You first need to create a file (called, say,
<em>annotations</em>) in the following format:

<tscreen><verb>
        @FILE { url1
        Attribute-Name-1:        DATA
        Attribute-Name-2:        DATA
        ...
        Attribute-Name-n:        DATA
        }

        @FILE { url2
        Attribute-Name-1:        DATA
        Attribute-Name-2:        DATA
        ...
        Attribute-Name-n:        DATA
        }

        ...
</verb></tscreen>

Note that the <em>Attributes</em> must begin in column 0 and have one tab
after the colon, and the <em>DATA</em> must be on a single line.

Next, run the <tt>mktemplate</tt> and <tt>template2db</tt> programs to
generate SOIF and then GDBM versions of these data (you can have several
files containing the annotations, and generate a single GDBM database from
the above commands):

<tscreen><verb>
        % set path = ($HARVEST_HOME/lib/gatherer $path)
        % mktemplate annotations [annotations2 ...] | template2db annotations.gdbm
</verb></tscreen>

Finally, you run <tt>mergedb</tt> to incorporate the annotations into the
automatically generated data, and <tt>mkindex</tt> to generate an index for
it.  The usage line for <tt>mergedb</tt> is:

<tscreen><verb>
        mergedb production automatic manual [manual ...]
</verb></tscreen>

The idea is that <em>production</em> is the final GDBM database that the
Gatherer will serve.  This is a <em>new</em> database that will be generated
from the other databases on the command line.  <em>automatic</em> is the GDBM
database that a Gatherer automatically generated in a previous run (e.g.,
<em>WORKING.gdbm</em> or a previous <em>PRODUCTION.gdbm</em>).  <em>manual</em>
and so on are the GDBM databases that you manually created.  When mergedb runs,
it builds the <em>production</em> database by first copying the templates from
the <em>manual</em> databases, and then merging in the attributes from the
<em>automatic</em> database.  In case of a conflict (the same attribute with
different values in the <em>manual</em> and <em>automatic</em> databases), the
<em>manual</em> values override the <em>automatic</em> values.

By keeping the automatically and manually generated data stored separately,
you can avoid losing the manual updates when doing periodic automatic
gathering.  To do this, you will need to set up a script to remerge the
manual annotations with the automatically gathered data after each
gathering.

An example use of <tt>mergedb</tt> is:

<tscreen><verb>
        % mergedb PRODUCTION.new PRODUCTION.gdbm annotations.gdbm
        % mv PRODUCTION.new PRODUCTION.gdbm
        % mkindex
</verb></tscreen>

If the manual database looked like this:

<tscreen><verb>
        @FILE { url1
        my-manual-attribute:  this is a neat attribute
        }
</verb></tscreen>

and the automatic database looked like this:

<tscreen><verb>
        @FILE { url1
        keywords:   boulder colorado
        file-size:  1034
        md5:        c3d79dc037efd538ce50464089af2fb6
        }
</verb></tscreen>

then in the end, the production database will look like this:

<tscreen><verb>
        @FILE { url1
        my-manual-attribute:  this is a neat attribute
        keywords:   boulder colorado
        file-size:  1034
        md5:        c3d79dc037efd538ce50464089af2fb6
        }
</verb></tscreen>

<sect1>Troubleshooting

<p>
<descrip>
<tag/Debugging/
Extra information from specific programs and library routines can be logged
by setting debugging flags.  A debugging flag has the form
<bf>-Dsection,level</bf>.  <em>Section</em> is an integer
in the range 1-255, and <em>level</em> is an integer in the range 1-9.
Debugging flags can be given on a command line, with the
<bf>Debug-Options:</bf> tag in a gatherer configuration file, or by setting
the environment variable <bf>$HARVEST_DEBUG</bf>.

Examples:

<tscreen><verb>
        Debug-Options: -D68,5 -D44,1
        % httpenum -D20,1 -D21,1 -D42,1 http://harvest.cs.colorado.edu/
        % setenv HARVEST_DEBUG '-D20,1 -D23,1 -D63,1'
</verb></tscreen>

Debugging sections and levels have been assigned to the following sections
of the code:

<tscreen><verb>
section  20, level 1, 5, 9          Common liburl URL processing
section  21, level 1, 5, 9          Common liburl HTTP routines
section  22, level 1, 5             Common liburl disk cache routines
section  23, level 1                Common liburl FTP routines
section  24, level 1                Common liburl Gopher routines
section  25, level 1                urlget - standalone liburl program.
section  26, level 1                ftpget - standalone liburl program.
section  40, level 1, 5, 9          Gatherer URL enumeration
section  41, level 1                Gatherer enumeration URL verification
section  42, level 1, 5, 9          Gatherer enumeration for HTTP
section  43, level 1, 5, 9          Gatherer enumeration for Gopher
section  44, level 1, 5             Gatherer enumeration filter routines
section  45, level 1                Gatherer enumeration for FTP
section  46, level 1                Gatherer enumeration for file:// URLs
section  48, level 1, 5             Gatherer enumeration robots.txt stuff
section  60, level 1                Gatherer essence data object processing
section  61, level 1                Gatherer essence database routines
section  62, level 1                Gatherer essence main
section  63, level 1                Gatherer essence type recognition
section  64, level 1                Gatherer essence object summarizing
section  65, level 1                Gatherer essence object unnesting
section  66, level 1, 2, 5          Gatherer essence post-summarizing
section  67, level 1                Gatherer essence object-ID code
section  69, level 1, 5, 9          Common SOIF template processing
section  70, level 1, 5, 9          Broker registry
section  71, level 1                Broker collection routines
section  72, level 1                Broker SOIF parsing routines
section  73, level 1, 5, 9          Broker registry hash tables
section  74, level 1                Broker storage manager routines
section  75, level 1, 5             Broker query manager routines
section  75, level 4                Broker query_list debugging
section  76, level 1                Broker event management routines
section  77, level 1                Broker main
section  78, level 9                Broker select(2) loop
section  79, level 1, 5, 9          Broker gatherer-id management
section  80, level 1                Common utilities memory management
section  81, level 1                Common utilities buffer routines
section  82, level 1                Common utilities system(3) routines
section  83, level 1                Common utilities pathname routines
section  84, level 1                Common utilities hostname processing
section  85, level 1                Common utilities string processing
section  86, level 1                Common utilities DNS host cache
section 101, level 1                Broker PLWeb indexing engine
section 102, level 1, 2, 5          Broker Glimpse indexing engine
section 103, level 1                Broker Swish indexing engine
</verb></tscreen>

<tag/Symptom/
The Gatherer <em>doesn't pick up all the objects</em> pointed to by some of my
RootNodes.

<tag/Solution/
The Gatherer places various limits on enumeration to prevent a
misconfigured Gatherer from abusing servers or running wildly.  See
section <ref id="RootNode specifications" name="RootNode
specifications"> for details on how to override these limits.

<tag/Symptom/
<em>Local-Mapping did not work</em> for me - it retrieved the objects via
the usual remote access protocols.

<tag/Solution/
A local mapping will fail if:

<itemize>
<item>the local filename cannot be opened for reading; or,
<item>the local filename is not a regular file; or,
<item>the local filename has execute bits set.
</itemize>

So for directories, symbolic links, and CGI scripts, the HTTP server is
always contacted.  We don't perform URL translation for local mappings.
If your URL's have funny characters that must be escaped, then the local
mapping will also fail.  Add debug option <bf>-D20,1</bf> to understand
how local mappings are taking place.

<tag/Symptom/
Using the <bf>--full-text</bf> option I see a lot of
<em>raw data</em> in the content summaries, with few keywords I can
search.

<tag/Solution/
At present <bf>--full-text</bf> simply includes the full data content
in the SOIF summaries.  Using the individual file type summarizing mechanism
described in Section <ref id="Customizing the summarizing step"
name="Customizing the summarizing step"> will work better in this
regard, but will require you to specify how data are extracted for each
individual file type.  In a future version of Harvest we will change the
Essence <bf>--full-text</bf> option to perform content extraction before
including the full text of documents.

<tag/Symptom/
No indexing terms are being generated in the SOIF summary for the META
tags in my HTML documents.

<tag/Solution/
This probably indicates that your HTML is not syntactically well-formed, and
hence the SGML-based HTML summarizer is not able to recognize it.  See
Section <ref id="Summarizing SGML data" name="Summarizing SGML data">
for details and debugging options.

<tag/Symptom/
Gathered data are <em>not being updated</em>.

<tag/Solution/
The Gatherer does not automatically do periodic updates.  See
Section <ref id="Periodic gathering and realtime updates"
name="Periodic gathering and realtime updates"> for details.

<tag/Symptom/
The Gatherer puts <em>slightly different URLs</em> in the <em>SOIF</em>
summaries than I specified in the Gatherer <em>configuration file</em>.

<tag/Solution/
This happens because the Gatherer attempts to put URLs into a canonical
format.  It does this by removing default port numbers and similar cosmetic
changes.  Also, by default, Essence (the content extraction subsystem
within the Gatherer) removes the standard stoplist.cf types, which includes
HTTP-Query (the cgi-bin stuff).

<tag/Symptom/
There are <em>no Last-Modification-Time</em> or <em>MD5 attributes</em> in my
gatherered SOIF data, so the Broker can't do duplicate elimination.

<tag/Solution/
If you gather remote, manually-created information, it is pulled into
Harvest using ``exploders'' that translate from the remote format into
SOIF.  That means they don't have a direct way to fill in the
Last-Modification-Time or MD5 information per record.  Note also
that this will mean one update to the remote records would cause all
records to look updated, which will result in more network load for
Brokers that collect from this Gatherer's data.  As a solution, you can
compute MD5s for all objects, and store them as part of the record.
Then, when you run the exploder you only generate timestamps for the
ones for which the MD5s changed - giving you real last-modification
times.

<tag/Symptom/
The Gatherer substitutes a ``%7e'' for a ``&#126;'' in all the user
directory URLs.

<tag/Solution/
The Gatherer conforms to <htmlurl url="http://www.ietf.org/rfc/rfc1738.txt"
name="RFC1738">, which says that a tilde inside a URL should be
encoded as ``%7e'', because it is considered an ``unsafe'' character.

<tag/Symptom/
When I search using keywords I know are in a document I have indexed
with Harvest, the <em>document isn't found</em>.

<tag/Solution/
Harvest uses a content extraction subsystem called <em>Essence</em>
that by default does not extract every keyword in a document.  Instead,
it uses heuristics to try to select promising keywords.  You can change
what keywords are selected by customizing the summarizers for that type
of data, as discussed in Section <ref id="Customizing the type
recognition, candidate selection, presentation unnesting, and
summarizing steps" name="Customizing the type recognition, candidate
selection, presentation unnesting, and summarizing steps">.  Or, you can
tell <em>Essence</em> to use full text summarizing if you feel the added
disk space costs are merited, as discussed in Section <ref id="Setting
variables in the Gatherer configuration file" name="Setting variables
in the Gatherer configuration file">.

<tag/Symptom/
I'm running Harvest on HP-UX, but the <tt>essence</tt> process in the
Gatherer <em>takes too much memory</em>.

<tag/Solution/
The supplied regular expression library has memory leaks on HP-UX, so
you need to use the regular expression library supplied with HP-UX.
Change the <em>Makefile</em> in <em>src/gatherer/essence</em> to read:

<tscreen><verb>
        REGEX_DEFINE    = -DUSE_POSIX_REGEX
        REGEX_INCLUDE   =
        REGEX_OBJ       =
        REGEX_TYPE      = posix
</verb></tscreen>

<tag/Symptom/
I built the configuration files to <em>customize</em> how Essence types/content
extracts data, but it <em>uses the standard typing/extracting</em> mechanisms
anyway.

<tag/Solution/
Verify that you have the <bf>Lib-Directory</bf> set to the <em>lib/</em>
directory that you put your configuration files.  <bf>Lib-Directory</bf>
is defined in your Gatherer configuration file.

<tag/Symptom/
I am having problems <em>resolving host names</em> on SunOS.

<tag/Solution/
In order to gather data from hosts outside of your organization, your system
must be able to resolve fully qualified domain names into IP addresses.
If your system cannot resolve hostnames, you will see error messages such
as ``Unknown Host.'' In this case, either:

<itemize>
<item>the hostname you gave does not really exist; or
<item>your system is not configured to use the DNS.
</itemize>

To verify that your system is configured for DNS, make sure that the
file <em>/etc/resolv.conf</em> exists and is readable.  Read the
resolv.conf(5) manual page for information on this file.  You can verify
that DNS is working with the <tt>nslookup</tt> command.

Some sites may use Sun Microsystem's Network Information Service (NIS)
instead of, or in addition to, DNS.  We believe that Harvest works on
systems where NIS has been properly configured.  The NIS servers (the
names of which you can determine from the <tt>ypwhich</tt> command) must be
configured to query DNS servers for hostnames they do not know about.
See the <bf>-b</bf> option of the <tt>ypxfr</tt> command.

<tag/Symptom/
I cannot get the Gatherer to work across our <em>firewall gateway</em>.

<tag/Solution/
Harvest only supports retrieving HTTP objects through a proxy.  It is
not yet possible to request Gopher and FTP objects through a firewall.
For these objects, you may need to run Harvest internally (behind the
firewall) or on the firewall host itself.

If you see the ``Host is unreachable'' message, these are the likely problems:

<itemize>
<item>your connection to the Internet is temporarily down due to a circuit or
      routing failure; or
<item>you are behind a firewall.
</itemize>

If you see the ``Connection refused'' message, the likely problem is
that you are trying to connect with an unused port on the destination
machine.  In other words, there is no program listening for connections
on that port.

The Harvest gatherer is essentially a WWW client.  You should expect it
to work the same as any Web browser.
</descrip>

<sect>The Broker
<label id="The Broker">

<p>

<sect1>Overview

<p>
The Broker retrieves and manages indexing information from Gatherers
and other Brokers, and provides a WWW query interface to the indexing
information.

<sect1>Basic setup

<p>
The Broker is automatically started by the <tt>RunHarvest</tt> command.
Other relevant commands are described in Section <ref id="Starting up
the system: RunHarvest and related commands" name="Starting up the
system: RunHarvest and related commands">.

In the current section we discuss various ways users can customize and
tune the Broker, how to administrate the Broker, and the various Broker
programming interfaces.

As suggested in Figure <ref id="img1" name="1">, the Broker uses a
flexible indexing interface that supports a variety of indexing
subsystems.  The default Harvest Broker uses <htmlurl
url="http://webglimpse.org/gdocs.html" name="Glimpse"> as indexer,
but other indexers such as Swish, and WAIS (both <url
url="ftp://ftp.cnidr.org/pub/software/freewais/" name="freeWAIS and
commercial WAIS">), also work with the Broker (see
Section <ref id="Using different index/search engines with the Broker"
name="Using different index/search engines with the Broker">).

To create a new Broker, run the <tt>CreateBroker</tt> program.  It will ask
you a series of questions about how you'd like to configure your Broker, and
then automatically create and configure it.  To start your Broker, use the
<tt>RunBroker</tt> program that <tt>CreateBroker</tt> generates.  The Broker
should be started when your system reboots.  To prevent a collection while
starting the broker, use the <bf>-nocol</bf> option.  There are a number of
ways you can customize or tune the Broker, discussed in Sections <ref
id="Tuning Glimpse indexing in the Broker" name="Tuning Glimpse
indexing in the Broker"> and <ref id="Using different index/search
engines with the Broker" name="Using different index/search engines
with the Broker">.  You may also use the <tt>RunHarvest</tt> command,
discussed in Section <ref id="Starting up the system: RunHarvest and
related commands" name="Starting up the system: RunHarvest and related
commands">, to create both a Broker and a Gatherer.

<sect1>Querying a Broker
<label id="Querying a Broker">

<p>
The Harvest Broker can handle many types of queries.  The queries handled by a
particular Broker depend on what index/search engine is being used inside of
it (e.g., WAIS does not support some of the queries that Glimpse does).  In
this section we describe the full syntax.  If a particular Broker does not
support a certain type of query, it will return an error when the user
requests that type of query.

The simplest query is a single keyword, such as:

<tscreen><verb>
        lightbulb
</verb></tscreen>

Searching for common words (like ``computer'' or ``html'') may take a lot of
time.

Particularly for large Brokers, it is often helpful to use more powerful
queries.  Harvest supports many different index/search engines, with varying
capabilities.  At present, our most powerful (and commonly used) search engine
is <htmlurl url="http://webglimpse.org/gdocs.html" name="Glimpse">,
which supports:

<itemize>
<item>case-insensitive and case-sensitive queries;
<item>matching parts of words, whole words, or multiple word phrases
      (like ``resource discovery'');
<item>Boolean (AND/OR) combinations of keywords;
<item>approximate matches (e.g., allowing spelling errors);
<item>structured queries (which allow you to constrain matches to certain
      attributes);
<item>displaying matched lines or entire matching records (e.g., for citations);
<item>specifying limits on the number of matches returned; and
<item>a limited form of regular expressions (e.g., allowing ``wild card''
      expressions that match all words ending in a particular suffix).
</itemize>

The different types of queries (and how to use them) are discussed below.
Note that you use the same syntax regardless of what index/search engine is
running in a particular Broker, but that not all engines support all of the
above features.  In particular, some of the Brokers use WAIS, which
sometimes searches faster than Glimpse but supports only Boolean keyword
queries and the ability to specify result set limits.

The different options - case-sensitivity, approximate matching, the
ability to show matched lines vs. entire matching records, and the ability
to specify match count limits - can all be specified with buttons and
menus in the Broker query forms.

A structured query has the form:

<tscreen><verb>
        tag-name : value
</verb></tscreen>

where <em>tag-name</em> is a Content Summary attribute name, and <em>value</em>
is the search value within the attribute.  If you click on a Content
Summary, you will see what attributes are available for a particular
Broker.  A list of common attributes is shown in Section
<ref id="List of common SOIF attribute names" name="List of common
SOIF attribute names">.

Keyword searches and structured queries can be combined using Boolean
operators (AND and OR) to form complex queries.  Lacking parentheses,
logical operation precedence is based left to right.  For multiple word
phrases or regular expressions, you need to enclose the string in double
quotes, e.g.,

<tscreen><verb>
        &quot;internet resource discovery&quot;
</verb></tscreen>

or

<tscreen><verb>
        &quot;discov.*&quot;
</verb></tscreen>

Double quotes should also be used when searching for non-alphanumeric
characters.

<sect2>Example queries

<p>
<descrip>
<tag/Simple keyword search query:/
<em>Arizona</em>

This query returns all objects in the Broker containing the word
<em>Arizona</em>.

<tag/Boolean query:/
<em>Arizona AND desert</em>

This query returns all objects in the Broker that contain both words
anywhere in the object in any order.

<tag/Phrase query:/
<em>&quot;Arizona desert&quot;</em>

This query returns all objects in the Broker that contain <em>Arizona
desert</em> as a phrase.  Notice that you need to put double quotes around the
phrase.

<tag/Boolean queries with phrases:/
<em>&quot;Arizona desert&quot; AND windsurfing</em>

This query returns all objects in the Broker that contain <em>Arizona
desert</em> as a phrase and the word windsurfing.

<tag/Simple Structured query:/
<em>Title : windsurfing</em>

This query returns all objects in the Broker where the <em>Title</em>
attribute contains the value <em>windsurfing</em>.

<tag/Complex query:/
<em>&quot;Arizona desert&quot; AND (Title : windsurfing)</em>

This query returns all objects in the Broker that contain the phrase
<em>Arizona desert</em> and where the <em>Title</em> attribute of the same
object contains the value <em>windsurfing</em>.
</descrip>

<sect2>Regular expressions

<p>
Some types of regular expressions are supported by Glimpse.  A regular
expression search can be much slower that other searches.  The following is
a partial list of possible patterns.  (For more details see the
<htmlurl url="http://webglimpse.org/gdocs.html" name="Glimpse
documentations">.)

<itemize>
<item><em>^joe</em> will match ``joe'' at the beginning of a line.
<item><em>joe$</em> will match ``joe'' at the end of a line.
<item><em>&lsqb;a-ho-z&rsqb;</em> matches any character between ``a'' and ``h'' or
      between ``o'' and ``z''.
<item><em>.</em> matches any single character except newline.
<item><em>c*</em> matches zero or more occurrences of the character ``c''.
<item><em>.*</em> matches any number of characters except newline.
<item><em>\*</em> matches the character ``*''.
      (<em>\</em> escapes any of the above special characters.)
</itemize>

Regular expressions are currently limited to approximately 30 characters,
not including meta characters.  Regular expressions will generally not
cross word boundaries (because only words are stored in the index).  So,
for example, <em>&quot;lin.*ing&quot;</em> will find ``linking'' or
``flinching,'' but not ``linear programming.''

<sect2>Query options selected by menus or buttons

<p>
The query page may have following checkboxes to allow some control of
the query specification.

<descrip>
<tag/Case insensitive:/
By selecting this checkbox the query will become case insensitive (lower
case and upper case letters don't differ).  Otherwise, the query will be case
sensitive.  The default is case insensitive.

<tag/Keywords match on word boundaries:/
By selecting this checkbox, keywords will match on word boundaries.
Otherwise, a keyword will match part of a word (or phrase).  For example,
&quot;network&quot; will match ``networking'', &quot;sensitive&quot; will
match ``insensitive'', and &quot;Arizona desert&quot; will match ``Arizona
desertness''.  The default is to match keywords on word boundaries.

<tag/Number of errors allowed:/
Glimpse allows the search to contain a number of errors.  An error is
either a deletion, insertion, or substitution of a single character.
The Best Match option will find the match(es) with the least number of
errors.  The default is 0 (zero) errors.
</descrip>

<em>Note:</em> The previous three options do not apply to attribute names.
Attribute names are always case insensitive and allow no errors.

<sect2>Filtering query results

<p>
Harvest allows to filter the results of a query by any query term
using any attribute defined in the <ref id="List of common SOIF
attribute names" name="List of common SOIF attribute names">.
This is done by defining <bf>filter</bf> parameters in the query form.
It is possible to define more that one filter parameter; they will be
concatenated by boolean <bf>AND</bf>.  Filter parameters consist of
two parts, separated by the pipe symbol ``|''.  The first part is a
query expression which is attached to the user query using
<bf>AND</bf> before sending the request to the broker.  The optional
second part is a HTML text that shall be displayd on the results page,
to give the user some information on the applied filter.

Example:

<tscreen><verb>
        &lt;SELECT NAME="filter"&gt;
        &lt;OPTION VALUE=''&gt;No Filter
        &lt;OPTION VALUE='uri: "xyz\.edu"|Seach only xyz.edu'&gt;Search xyz.edu only
        &lt;OPTION VALUE='type: html|HTML documents only'&gt;Search HTML documents only
        &lt;/SELECT&gt;
</verb></tscreen>

The first option returns an unfiltered output.  The second option
returns only pages found on pages with ``xyz.edu'' in their URL.  The
third option returns only HTML-documents.  See the advanced search
page of the broker for more examples.

<sect2>Result set presentation

<p>
The query page may have following checkboxes allow some control of
presentation of the query return.

<descrip>
<tag/Display matched lines (from content summaries):/
By selecting this checkbox, the result set presentation will contain the
lines of the Content Summary that matched the query.  Otherwise, the
matched lines will not be displayed.  The default is to display the matched
lines.

<tag/Display object descriptions (if available):/
Some objects have short, one-line descriptions associated with them.  By
selecting this checkbox, the descriptions will be presented.  Otherwise,
the object descriptions will not be displayed.  The default is to display
object descriptions.

<tag/Display links to indexed content summary:/
This checkbox allows you to set whether links to the indexed content
summaries are displayed or not.  The default is not to display links to
inexed content summaries.
</descrip>

<sect1>Customizing the Broker's Query Result Set

<p>
It is possible for the Harvest administrator to customize how the Broker
query result set is generated, by modifying a configuration file that is
interpreted by the <tt>search.cgi</tt> Perl program at query result time.

<tt>search.cgi</tt> allows you to customize almost every aspect of its
HTML output.  The file <em>$HARVEST_HOME/cgi-bin/lib/search.cf</em>
contains the default output definitions.  Individual brokers can be customized
by creating a similar file which overrides the default definitions.

<sect2>The search.cf configuration file
<label id="The search.cf configuration file">

<p>
Definitions are enclosed within SGML-like beginning and ending tags.
For example:

<tscreen><verb>
        &lt;HarvestUrl&gt;
        http://harvest.sourceforge.net/
        &lt;/HarvestUrl&gt;
</verb></tscreen>

The last newline character is removed from each definition, so that the
above becomes the string ``http://harvest.sourceforge.net/.''

Variable substitution occurs on every definition before it is output.  A
number of specific variables are defined by <tt>search.cgi</tt> which can
be used inside a definition.  For example:

<tscreen><verb>
        &lt;BrokerLoad&gt;
        Sorry, the Broker at &lt;STRONG&gt;$host, port $port&lt;/STRONG&gt;
        is currently too heavily loaded to process your request.
        Please try again later.&lt;P&gt;
        &lt;/BrokerLoad&gt;
</verb></tscreen>

When this definition is printed out, the variables <em>$host</em> and
<em>$port</em> would be replaced with the hostname and port of the broker.

<sect3>Defined Variables

<p>
The following variables are defined as soon as the query string is processed.
They can be used before the broker returns any results.

<tscreen><verb>
        $maxresult    The maximum number of matched lines to be returned
        $host         The broker hostname
        $port         The broker port
        $query        The query string entered by the user
        $bquery       The whole query string sent to the broker
</verb></tscreen>

These variables are defined for each matched object returned
by the broker.

<tscreen><verb>
        $objectnum   The number of the returned object
        $desc        The description attribute of the matched object
        $opaque      ALL the matched lines from the matched object
        $url         The original URL of the matched object
        $A           The access method of $url (e.g.: http)
        $H           The hostname (including port) from $url
        $P           The path part of $url
        $D           The directory part of $P
        $F           The filename part of $P
        $cs_url      The URL of the content summary in the broker database
        $cs_a        Access part of $cs_url
        $cs_h        Hostname part of $cs_url
        $cs_p        Path part of $cs_url
        $cs_d        Directory part of $cs_p
        $cs_f        Filename part of $cs_p
</verb></tscreen>

<sect3>List of Definitions

<p>
Below is a partial list of definitions.  A complete list can be
found in the search.cf file.  Only definitions likely to be
customized are described here.

<descrip>
<tag><bf>&lt;Timeout&gt;</bf></tag>
Timeout value for <tt>search.cgi</tt>.  If the broker doesn't respond
within this time, <tt>search.cgi</tt> will exit.

<tag><bf>&lt;ResultHeader&gt;</bf></tag>
The first part of the result page.  Should probably contain the HTML
<bf>&lt;TITLE&gt;</bf> element and the user query string.

<tag><bf>&lt;ResultTrailer&gt;</bf></tag>
The last part of the result page.  The default has URL references to the
broker home page and the Harvest project home page.

<tag><bf>&lt;ResultSetBegin&gt;</bf></tag>
This is output just before looping over all the matched objects.

<tag><bf>&lt;ResultSetEnd&gt;</bf></tag>
This is output just after ending the loop over matched objects.

<tag><bf>&lt;PrintObject&gt;</bf></tag>
This definition prints out a matched object.  It should probably include
the variables <em>$url, $cs_url, $desc</em>, and <em>$opaque</em>.

<tag><bf>&lt;EndBrokerResults&gt;</bf></tag>
Printed between <bf>&lt;ResultSetEnd&gt;</bf> and
<bf>&lt;ResultTrailer&gt;</bf> if the query was successful.  Should
probably include a count of matched objects and/or matched lines.

<tag><bf>&lt;FailBrokerResults&gt;</bf></tag>
Similar to <bf>&lt;EndBrokerResults&gt;</bf> but
prints if the broker returns an error in response to the query.

<tag><bf>&lt;ObjectNumPrintf&gt;</bf></tag>
A <tt>printf</tt> format string for the object number
(<em>$objectnum</em>).

<tag><bf>&lt;TruncateWarning&gt;</bf></tag>
Prints a warning message if the result set was truncated at the maximum
number of matched lines.
</descrip>

These following definitions are somewhat different because they are evaluated
as Perl instructions rather than strings.

<descrip>
<tag><bf>&lt;MatchedLineSub&gt;</bf></tag>
Evaluated for every matched line returned by the broker.  Can be used to
indent matched lines or to remove the leading ``Matched line'' and
attribute name strings.

<tag><bf>&lt;InitFunction&gt;</bf></tag>
Evaluated near the beginning of the <tt>search.cgi</tt> program.  Can be
used to set up special variables or read data files.

<tag><bf>&lt;PerObjectFunction&gt;</bf></tag>
Evaluated for each object just before <bf>&lt;PrintObject&gt;</bf>
is called.

<tag><bf>&lt;FormatAttribute&gt;</bf></tag>
Evaluated for each SOIF attribute requested for matched objects (see
Section <ref id="Displaying SOIF attributes in results"
name="Displaying SOIF attributes in results">).  <em>$att</em> is
set to the attribute name, and <em>$val</em> is set to the attribute
value.
</descrip>

<sect2>Example search.cf customization file

<p>
The following definitions demonstrate how to change the
<tt>search.cgi</tt> output.  The <bf>&lt;PerObjectFunction&gt;</bf>
ensures that the description is not empty.  It also prepends the
string ``matched data:'' before any matched lines.  The
<bf>&lt;PrintObject&gt;</bf> specification prints the object number,
description, and indexing data all on the first line.  The description
is wrapped around HMTL anchor tags so that it is a link to the object
originally gathered.  The words ``indexing data'' are a link to the
displaySOIF program which will format the content summary for HTML
browsers.  The object number is formatted as a number in parenthesis
such that the whole thing takes up four spaces.

The <bf>&lt;MatchedLineSub&gt;</bf> definition includes four
substitution expressions.  The first removes the words ``Matched line:''
from the beginning of each matched line.  The second removes SOIF
attributes of the form ``<em>partial-text{43}:</em>'' from the
beginning of a line.  The third displays the attribute names (e.g.
<em>partial-text#</em>) in italics.  The last
expression indents each line by five spaces to align it with the description
line.  The definition for <bf>&lt;EndBrokerResults&gt;</bf> slightly
modifies the report of how many objects were matched.

<tscreen><verb>
        # Demo to show some of the customization features for the Harvest output
        # More information can be found in the manual at:
        # http://harvest.sourceforge.net/harvest/doc/html/manual.html


        # The PerObjectFunction is Perl code evaluated for every hit
        &lt;PerObjectFunction&gt;
        # Create description
        # Is the descriptions provided by Harvest very short (e.g. missing &lt;TITLE&gt;)?
        if (length($desc) &lt; 5) {
          # Yes: use filename ($F) instead
          $description = "&lt;I&gt;File:&lt;/I&gt; $F";
        } else {
          # No: use description provided by Harvest
          $description = $desc;
        }

        # Format matched lines ("opaque data") if data is present
        if ($opaque ne '') {
          $opaque = "&lt;strong&gt;matched lines:&lt;/strong&gt;&lt;BR&gt;$opaque"
        }
        &lt;/PerObjectFunction&gt;


        # PrintObject defines the apperance of hits
        &lt;PrintObject&gt;
        $objectnum &lt;A HREF=&quot;$url&quot;&gt;&lt;STRONG&gt;$description&lt;/STRONG&gt;&lt;/A&gt; \
        [&lt;A HREF=&quot;$cs_a://$cs_h/Harvest/cgi-bin/displaySOIF.cgi?object=$cs_p&quot;&gt;\
        indexing data&lt;/A&gt;]
        &lt;pre&gt;
             $opaque
        &lt;/pre&gt;\n
        &lt;/PrintObject&gt;


        # Format the appearance of the hit number
        &lt;ObjectNumPrintf&gt;
        (%2d)
        &lt;/ObjectNumPrintf&gt;


        # Format the appearance of every matched line
        &lt;MatchedLineSub&gt;
        s/^Matched line: *//;            # Remove "Matched line:"
        s/^([\w-]+# )[\w-]+{\d+}:\t/\1/; # Remove SOIF attributes of the form "partial-text{43}:"
        s/^([\w-]+#)/&lt;I&gt;\1&lt;\/I&gt;/;        # Format attribute names as italics
        s/^.*/     $&amp;/;                  # Add spaces to indent text
        &lt;/MatchedLineSub&gt;


        # Modifies the report of how many objects were matched
        &lt;EndBrokerResults&gt;
        &lt;STRONG&gt;Found $nopaquelines matched lines, $nobjects objects.&lt;/STRONG&gt;
        &lt;P&gt;\n
        &lt;/EndBrokerResults&gt;
</verb></tscreen>

<sect2>Integrating your customized configuration file

<p>
The <tt>search.cgi</tt> configuration files are kept in
<em>$HARVEST_HOME/cgi-bin/lib</em>.  The name of a customized file is
listed in the <em>query.html</em> form, and passed as an option to the
<tt>search.cgi</tt> program.

The simplest way to specify the customized file is by placing
an <bf>&lt;INPUT&gt;</bf> tag in the HTML form:

<tscreen><verb>
        &lt;INPUT TYPE=&quot;hidden&quot; NAME=&quot;brokerqueryconfig&quot; VALUE=&quot;custom.cf&quot;&gt;
</verb></tscreen>

Another way is to allow users to select from different customizations
with a <bf>&lt;SELECT&gt;</bf> list:

<tscreen><verb>
        &lt;SELECT NAME=&quot;brokerqueryconfig&quot;&gt;
        &lt;OPTION VALUE=&quot;&quot;&gt; Default
        &lt;OPTION VALUE=&quot;custom1.cf&quot;&gt; Customized
        &lt;OPTION VALUE=&quot;custom2.cf&quot; SELECTED&gt; Highly Customized
        &lt;/SELECT&gt;
</verb></tscreen>

<sect2>Displaying SOIF attributes in results
<label id="Displaying SOIF attributes in results">

<p>
It is possible to request SOIF attributes from the HTML query form.  A
simple approach is to include a select list in the query form:

<tscreen><verb>
        &lt;SELECT MULTIPLE NAME=&quot;attribute&quot;&gt;
        &lt;OPTION VALUE=&quot;title&quot;&gt;
        &lt;OPTION VALUE=&quot;author&quot;&gt;
        &lt;OPTION VALUE=&quot;date&quot;&gt;
        &lt;OPTION VALUE=&quot;subject&quot;&gt;
        &lt;/SELECT&gt;
</verb></tscreen>

In this manner, the user may control which attributes get displayed.
The layout of these attributes when the results are displayed in HTML
is controlled by the <bf>&lt;FormatAttribute&gt;</bf> specification in the
<em>search.cf</em> file described in Section <ref id="The search.cf
configuration file" name="The search.cf configuration file">.

<sect1>World Wide Web interface description

<p>
To allow Web browsers to easily interface with the Broker, we implemented a
World Wide Web interface to the Broker's query manager and administrative
interfaces.  This WWW interface, which includes several HTML files and a few
programs that use the <htmlurl
url="http://hoohoo.ncsa.uiuc.edu/cgi/overview.html" name="Common
Gateway Interface"> (CGI), consists of the following:

<itemize>
<item>HTML files that use <htmlurl
      url="http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/fill-out-forms/overview.html"
      name="Forms"> support to present a graphical user interface
      (GUI) to the user;
<item>CGI programs that act as a gateway between the user and the
      Broker; and
<item>Help files for the user.
</itemize>

Users go through the following steps when using a Broker to locate information:

<enum>
<item>The user issues a query to the Broker.
<item>The Broker processes the query, and returns the query results to the user.
<item>The user can then view content summaries from the result set,
      or access the URLs from the result set directly.
</enum>

To provide a WWW-queryable interface, the Broker needs to run in
conjunction with an HTTP server.  Section <ref id="Additional
installation for the Harvest Broker" name="Additional installation for
the Harvest Broker"> describes how to configure your HTTP server to
work with Harvest.

You can run the Broker on a different machine than your HTTP server runs
on, but if you want users to be able to view the Broker's content
summaries then the Broker's files will need to be accessible to your
HTTP server.  You can NFS mount those files or manually copy them over.
You'll also need to change the <em>Brokers.cf</em> file to point to the
host that is running the Broker.

<sect2>HTML files for graphical user interface

<p>
<tt>CreateBroker</tt> creates some HTML files to provide GUIs to the user:

<descrip>
<tag><em>query.html</em></tag>
Contains the GUI for the query interface.  <tt>CreateBroker</tt> will
install different <em>query.html</em> files for Glimpse, Swish, and WAIS,
since each subsystem requires different defaults and supports different
functionality (e.g., WAIS doesn't support approximate matching like
Glimpse).  This is also the ``home page'' for the Broker and a link to
this page is included at the bottom of all query results.

<tag><em>admin.html</em></tag>
Contains the GUI for the administrative interface.  This file is installed
into the <em>admin</em> directory of the Broker.

<tag><em>Brokers.cf</em