The Harvest Gatherer's actions are defined by a set of configuration and utility files, and a corresponding set of executable programs referenced by some of the configuration files.
If you want to customize a Gatherer, you should create bin and lib subdirectories in the directory where you are running the Gatherer, and then copy $HARVEST_HOME/lib/gatherer/*.cf and $HARVEST_HOME/lib/gatherer/magic into your lib directory. Then add to your Gatherer configuration file:
Lib-Directory: libThe details about what each of these files does are described below. The basic contents of a typical Gatherer's directory is as follows (note: some of the files names below can be changed by setting variables in the Gatherer configuration file, as described in Section 4.7.1):
RunGatherd* bin/ GathName.cf log.errors tmp/ RunGatherer* data/ lib/ log.gatherer bin: MyNewType.sum* Exploder.unnest* data: All-Templates.gz INFO.soif gatherd.cf INDEX.gdbm PRODUCTION.gdbm gatherd.log lib: bycontent.cf byurl.cf quick-sum.cf byname.cf magic stoplist.cf tmp: cache-liburl/
The RunGatherd and RunGatherer are used to export the Gatherer's database after a machine reboot and to run the Gatherer, respectively. The log.errors and log.gatherer files contain error messages and the output of the Essence typing step, respectively (Essence will be described shortly). The GathName.cf file is the Gatherer's configuration file.
The bin directory contains any summarizers and any other program needed by the summarizers or by the presentation unnesting steps. If you were to customize the Gatherer by adding a summarizer or a presentation unnesting program, you would place those programs in this bin directory; the MyNewType.sum and Exploder.unnest are examples (see Section 4.5.4).
The data directory contains the Gatherer's database which gatherd exports. The Gatherer's database consists of the All-Templates.gz, INDEX.gdbm, INFO.soif, and PRODUCTION.gdbm files. The gatherd.cf file is used to support access control as described in Section 4.7.4. The gatherd.log file is where the gatherd program logs its information.
The lib directory contains the configuration files used by the Gatherer's subsystems, namely Essence. These files are described briefly in the following table:
bycontent.cf Content parsing heuristics for type recognition step byname.cf File naming heuristics for type recognition step byurl.cf URL naming heuristics for type recognition step magic UNIX ``file'' command specifications (matched against bycontent.cf strings) quick-sum.cf Extracts attributes for summarizing step. stoplist.cf File types to reject during candidate selection
We discuss each of the customizable steps in the subsections below.