next up previous contents index
Next: B The Summary Object Up: A Programs and layout Previous: A.8 $HARVEST_HOME/lib/cache

A.9 $HARVEST_HOME/lib/gatherer

The $HARVEST_HOME/lib/gatherer directory contains the default summarizers described in Section 4.5.1, plus various utility programs needed by the summarizers and the Gatherer, as follows:

Essence summarizers as discussed in Section 4.5.3.

Essence presentation unnesters or exploders as discussed in Section 4.5.4 and Appendix C.2.

Programs used by Essence presentation unnesters or exploders to convert files into SOIF streams.

  ,,, magic,,
Essence configuration files as described in Section 4.5.4.

Programs used to check the validity of a SOIF stream (e.g., to ensure that there is not parsing errors).
Usage: cksoif < INPUT.soif

                  cleandb, consoldb, expiredb, folddb, mergedb, mkcompressed,,
, and rmbinary
Programs used to prepare a Gatherer's database to be exported by gatherd. cleandb ensures that all SOIF objects are valid, and deletes any that are not; consoldb will consolidate n GDBM database files into a single GDBM database file; expiredb deletes any SOIF objects that are no longer valid as defined by its Time-To-Live attribute; folddb runs all of the operations needed to prepare the Gatherer's database for export by gatherd; mergedb consolidates GDBM files as described in Section 4.7.7; mkcompressed generates the compressed cache All-Templates.gz file; generates the INFO.soif statistics file; mkindex generates the cache of timestamps; and rmbinary removes binary data from a GDBM database.

                        deroff, detex, dvi2tty, extract-perl-procs, extract-urls, get-include-files,
print-c-comments, ps2txt, ps2txt-2.1, pstext, skim
, and unshar
Programs to support various summarizers.

                    dbcheck, enum, fileenum, ftpenum,, gopherenum, httpenum,
newsenum, prepurls
, and staturl
Programs used to perform the RootNode enumeration for the Gatherer as described in Section 4.3. dbcheck checks a URL to see if it has changed since the last time it was gathered; enum peforms a RootNode enumeration on the given URLs; fileenum peforms a RootNode enumeration on ``file'' URLs; ftpenum calls to peform a RootNode enumeration on ``ftp'' URLs; gopherenum peforms a RootNode enumeration on ``gopher'' URLs; httpenum peforms a RootNode enumeration on ``http'' URLs; newsenum peforms a RootNode enumeration on ``news'' URLs; prepurls is a wrapper program used to pipe Gatherer and essence together; staturl retrieves LeafNode URLs so that dbcheck can determine if the URL has been modified or not. All of these programs are internal to Gatherer.

The Essence content extraction system as described in Section 4.5.4.
Usage: essence [options] -f input-URLs or essence [options] URL ...

        --dbdir directory       Directory to place database
        --full-text             Use entire file instead of summarizing
        --gatherer-host         Gatherer-Host value
        --gatherer-name         Gatherer-Name value
        --gatherer-version      Gatherer-Version value
        --help                  Print usage information
        --libdir directory      Directory to place configuration files
        --log logfile           Name of the file to log messages to
        --max-deletions n       Number of GDBM deletions before reorganization
        --minimal-bookkeeping   Generates a minimal amount of bookkeeping attrs
        --no-access             Do not read contents of objects
        --no-keywords           Do not automatically generate keywords
        --allowlist filename    File with list of types to allow
        --stoplist filename     File with list of types to remove
        --tmpdir directory      Name of directory to use for temporary files
        --type-only             Only type data; do not summarize objects
        --verbose               Verbose output
        --version               Version information

    extractdb, print-attr
Prints the value of the given attribute for each SOIF object stored in the given GDBM database. print-attr uses stdin rather than GDBM-file.
Usage: extractdb GDBM-file Attribute

    gatherd, in.gatherd
Daemons that exports the Gatherer's database. in.gatherd is used to run this daemon from inetd.
Usage: gatherd [-db | -index | -log | -zip | -cf file] [-dir dir] port
Usage: in.gatherd [-db | -index | -log | -zip | -cf file] [-dir dir]

Program to perform various operations on a GDBM database.
Usage: gdbmutil consolidate [-d | -D] master-file file [file ...]
Usage: gdbmutil delete file key
Usage: gdbmutil dump file
Usage: gdbmutil fetch file key
Usage: gdbmutil keys file
Usage: gdbmutil print [-gatherd] file
Usage: gdbmutil reorganize file
Usage: gdbmutil restore file
Usage: gdbmutil sort file
Usage: gdbmutil stats file
Usage: gdbmutil store file key < data

    mktemplate, print-template
Program to generate valid SOIF based on a more easily editable SOIF-like format (e.g., SOIF without the byte counts). print-template can be used to ``normalize'' a SOIF stream; it reads a stream of SOIF templates from stdin, parses them, then writes a SOIF stream to stdout.
Usage: mktemplate < INPUT.txt > OUTPUT.soif

Simple Perl program to emulate Essence's processing for those who cannot compile Essence with the corresponding C code.

Converts a stream of SOIF objects (from stdin or given files) into a GDBM database.
Usage: template2db database [tmpl tmpl...]

Wraps the data from stdin into a SOIF attribute-value pair with a byte count. Used by Essence summarizers to easily generate SOIf.
Usage: wrapit [Attribute]


next up previous contents index
Next: B The Summary Object Up: A Programs and layout Previous: A.8 $HARVEST_HOME/lib/cache

Duane Wessels
Wed Jan 31 23:46:21 PST 1996