As illustrated in Figure 1, Harvest consists of several subsystems. The Gatherer subsystem collects indexing information (such as keywords, author names, and titles) from the resources available at Provider sites (such as FTP and HTTP servers). The Broker subsystem retrieves indexing information from one or more Gatherers, suppresses duplicate information, incrementally indexes the collected information, and provides a WWW query interface to it. The Replicator subsystem efficiently replicates Brokers around the Internet. Users can efficiently retrieve located information through the Cache subsystem. The Harvest Server Registry (HSR) is a distinguished Broker that holds information about each Harvest Gatherer, Broker, Cache, and Replicator in the Internet.
Figure 1: Harvest Software Components
You should start using Harvest simply, by installing a single ``stock'' (i.e., not customized) Gatherer and Broker on one machine to index some of the FTP, Gopher, World Wide Web, and NetNews data at your site. You may also want to run an Object Cache (see Section 6), to reduce network traffic for accessing popular data.
After you get the system working in this basic configuration, you can invest additional effort as warranted. First, as you scale up to index larger volumes of information, you can reduce the CPU and network load to index your data by distributing the gathering process. Second, you can customize how Harvest extracts, indexes, and searches your information, to better match the types of data you have and the ways your users would like to interact with the data. Finally, you can reduce overloading on popular Brokers by running Replicators.
We discuss how to distribute the gathering process in the next subsection. We cover various forms of customization in Section 4.5.4 and in several parts of Section 5. We discuss Broker replication in Section 7.