The top directory of where you installed Harvest is known as $HARVEST_HOME. By default, $HARVEST_HOME is /usr/local/harvest. The following files and directories are located in $HARVEST_HOME:
RunHarvest* brokers/ gatherers/ tmp/ bin/ cgi-bin/ lib/
RunHarvest is the script used to create and run Harvest servers (see
Starting up the system: RunHarvest and related commands).
RunHarvest has the same command line syntax
The $HARVEST_HOME/bin directory only contains programs that users would normally run directly. All other programs (e.g., individual summarizers for the Gatherer) as well as Perl library code are in the lib directory. The bin directory contains the following programs:
Creates a Broker.
CreateBroker [skeleton-tree [destination]]
Main user interface to the Gatherer. This program is run by the
RunGatherer script found in a Gatherer's directory.
Gatherer [-manual|-export|-debug] file.cf
The program used by
RunHarvest to create and run Harvest servers
as per the user's description.
where flags can be any of the following:
-novice Simplest Q&A. Mostly uses the defaults. -glimpse Use Glimpse for the Broker. (default) -swish Use Swish for the Broker. -wais Use WAIS for the Broker. -dumbtty Dumb TTY mode. -debug Debug mode. -dont-run Don't run the Broker or the Gatherer. -fake Doesn't build the Harvest servers. -protect Don't change the umask.
The Broker program. This program is run by the
script found in a Broker's directory. Logs messages to both
broker.out and to admin/LOG.
broker [broker.conf file] [-nocol]
The client interface to the Gatherer.
gather [-info] [-nocompress] host port [timestamp]
The $HARVEST_HOME/brokers directory contains images and logos
in images directory, some basic tutorial HTML pages, and the
skeleton files that
CreateBroker uses to construct new
Brokers. You can change the default values in these created Brokers by
editing the files in skeleton.
The $HARVEST_HOME/cgi-bin directory contains the programs
needed for the WWW interface to the Broker (described in Section
CGI programs) and configuration files for
search.cgi in lib directory.
The $HARVEST_HOME/gatherers directory contains example
Gatherers discussed in Section
RunHarvest, by default, will
create the new Gatherer in this directory.
The $HARVEST_HOME/lib directory contains number of Perl library routines and other programs needed by various parts of Harvest, as follows:
Perl libraries used to communicate with remote FTP servers.
Perl libraries used to parse
Program used to retrieve files and directories from FTP servers.
ftpget [-htmlify] localfile hostname filename A,I username password
Perl program used to retrieve files and menus from Gopher servers.
gopherget.pl localfile hostname port command
Perl program to check whether gatherers and brokers are up.
Program used to compute MD5 checksums.
md5 file [...]
Perl program used to retrieve USENET articles and group summaries from NNTP servers.
newsget.pl localfile news-URL
Perl library used to process SOIF.
Program used to retrieve a URL.
Program to purge the local disk URL cache used by
urlget and the
The $HARVEST_HOME/lib/broker directory contains the search and index programs needed by the Broker, plus several utility programs needed for Broker administration, as follows:
This program will issue a restart command to a broker.
BrokerRestart [-password passwd] host port
Client interface to the broker. Can be used to send queries or administrative commands to a broker.
brkclient hostname port command-string
Prints the Broker's Registry file in a human-readable format.
dumpregistry [-count] [BrokerDirectory]
agrep, glimpse, glimpseindex, glimpseindex.bin, glimpseserver
The Glimpse indexing and search system as described in Section The Broker.
The Swish indexing and search program as an alternative to Glimpse.
Perl programs used to generate Broker statistics and to create stats.html.
gather -info host port | info-to-html.pl > host.port.html
mkbrokerstats.pl broker-dir > stats.html
The $HARVEST_HOME/lib/gatherer directory contains the default summarizers described in Section Extracting data for indexing: The Essence summarizing subsystem, plus various utility programs needed by the summarizers and the Gatherer, as follows:
Default URL filter as described in Section RootNode specifications.
Essence configuration files as described in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps.
Essence summarizers as discussed in Section Extracting data for indexing: The Essence summarizing subsystem.
Alternative HTML summarizer written in Perl.
Program to extract URLs from a HTML file.
HTMLurls [--base-url url] filename
Programs and files used by Microsoft Word summarizer.
dvi2tty, print-c-comments, ps2txt, ps2txt-2.1, pstext, skim
Programs used by various summarizers.
Program to support summarizers.
Program used by TeX summarizer.
rast, smgls, sgmlsasp,sgmls-lib
Programs and files used by SGML summarizer.
Program used by RTF summarizer.
Programs and files used by WordPerfect summarizer.
hexbin, unshar, uudecode
Programs used to unnest nested objects.
Programs used to check the validity of a SOIF stream (e.g., to ensure that there is not parsing errors).
cksoif < INPUT.soif
cleandb, consoldb, expiredb, folddb, mergedb, mkgathererstats.pl, mkindex, rmbinary
Programs used to prepare a Gatherer's database to be exported by
cleandb ensures that all SOIF objects are valid, and deletes
any that are not;
consoldb will consolidate n GDBM database files into a single GDBM database file;
expiredb deletes any SOIF objects that are no longer
valid as defined by its Time-to-Live attribute;
folddb runs all of the operations needed to prepare the
Gatherer's database for export by
mergedb consolidates GDBM files as described in Section
Incorporating manually generated information into a Gatherer;
mkgathererstats.pl generates the INFO.soif
mkindex generates the cache of timestamps; and
rmbinary removes binary data from a GDBM database.
enum, prepurls, staturl
Programs used by
Gatherer to perform the RootNode and
LeafNode enumeration for the Gatherer as described in Section
enum performs a RootNode enumeration on the given URLs;
prepurls is a wrapper program used to pipe
staturl retrieves LeafNode URLs to determine if the URL has
been modified or not.
fileenum, ftpenum, ftpenum.pl, gopherenum-*, httpenum-*, newsenum
Programs used by
enum to perform protocol specific
fileenum performs a RootNode enumeration on ``file'' URLs;
ftpenum.pl to perform a RootNode
enumeration on ``ftp'' URLs;
gopherenum-breadth performs a breadth first RootNode
enumeration on ``gopher'' URLs;
gopherenum-depth performs a depth first RootNode enumeration
on ``gopher'' URLs;
httpenum-breadth performs a breadth first RootNode
enumeration on ``http'' URLs;
httpenum-depth performs a depth first RootNode enumeration on
newsenum performs a RootNode enumeration on ``news'' URLs;
The Essence content extraction system as described in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps.
essence [options] -f input-URLs
essence [options] URL ...
where options are:
--dbdir directory Directory to place database --full-text Use entire file instead of summarizing --gatherer-host Gatherer-Host value --gatherer-name Gatherer-Name value --gatherer-version Gatherer-Version value --help Print usage information --libdir directory Directory to place configuration files --log logfile Name of the file to log messages to --max-deletions n Number of GDBM deletions before reorganization --minimal-bookkeeping Generates a minimal amount of bookkeeping attrs --no-access Do not read contents of objects --no-keywords Do not automatically generate keywords --allowlist filename File with list of types to allow --stoplist filename File with list of types to remove --tmpdir directory Name of directory to use for temporary files --type-only Only type data; do not summarize objects --verbose Verbose output --version Version information
Reads in a SOIF stream from stdin and prints the data associated with the given attribute to stdout.
cat SOIF-file | print-attr Attribute
Daemons that exports the Gatherer's database.
in.gatherd is used to run this daemon from inetd.
gatherd [-db | -index | -log | -zip | -cf file] [-dir dir] port
in.gatherd [-db | -index | -log | -zip | -cf file] [-dir dir]
Program to perform various operations on a GDBM database.
Usage: gdbmutil consolidate [-d | -D] master-file file [file ...] Usage: gdbmutil delete file key Usage: gdbmutil dump file Usage: gdbmutil fetch file key Usage: gdbmutil keys file Usage: gdbmutil print [-gatherd] file Usage: gdbmutil reorganize file Usage: gdbmutil restore file Usage: gdbmutil sort file Usage: gdbmutil stats file Usage: gdbmutil store file key < data
Program to generate valid SOIF based on a more easily editable SOIF-like format (e.g., SOIF without the byte counts).
mktemplate < INPUT.txt > OUTPUT.soif
Simple Perl program to emulate Essence's quick-sum.cf processing for those who cannot compile Essence with the corresponding C code.
Converts a stream of SOIF objects (from stdin or given files) into a GDBM database.
template2db database [tmpl tmpl...]
Wraps the data from stdin into a SOIF attribute-value pair with a byte count. Used by Essence summarizers to easily generate SOIf.
Script to kill gatherd process.
The $HARVEST_HOME/tmp directory is used by search.cgi to store search result pages.