The Gatherer retrieves information resources using a variety of standard access methods (FTP, Gopher, HTTP, NNTP, and local files), and then summarizes those resources in various type-specific ways to generate structured indexing information. For example, a Gatherer can retrieve a technical report from an FTP archive, and then extract the author, title, and abstract from the paper to summarize the technical report. Harvest Brokers or other search services can then retrieve the indexing information from the Gatherer to use in a searchable index available via a WWW interface.
The Gatherer consists of a number of separate components. The Gatherer program reads a Gatherer configuration file and controls the overall process of enumerating and summarizing data objects. The structured indexing information that the Gatherer collects is represented as a list of attribute-value pairs using the Summary Object Interchange Format (SOIF, see Appendix B). The gatherd daemon serves the Gatherer database to Brokers. It hangs around, in the background, after a gathering session is complete. A stand-alone gather program is a client for the gatherd server. It can be used from the command line for testing, and is used by the Broker. The Gatherer uses a local disk cache to store objects it has retrieved. The disk cache is described in Section 4.7.6
Even though the gatherd daemon remains in the background, a Gatherer does not automatically update or refresh its summary objects. Each object in a Gatherer has a time-to-live value. Objects remain in the database until they expire. See section 4.7.5 for more information on keeping Gatherer objects up to date.
Several example Gatherers are provided with the Harvest software distribution (see Appendix C).