next up previous contents index
Next: Gathering News URLs Up: 4 The Gatherer Previous: 4.1 Overview

4.2 Basic setup

 

To run a basic Gatherer, you need only list the Uniform Resource Locators (URLs) [2,3] from which it will gather indexing information. This list is specified in the Gatherer configuration file, along with other optional information such as the Gatherer's name and the directory in which it resides (see Section 4.7.1 for details on the optional information). Below is an example Gatherer configuration file:

        #
        #  sample.cf - Sample Gatherer Configuration File
        #
        Gatherer-Name:    My Sample Harvest Gatherer
        Top-Directory:    /usr/local/harvest/gatherers/sample

        #  Specify the URLs from which to gather.
        <RootNodes>
        http://harvest.cs.colorado.edu/
        </RootNodes>

        <LeafNodes>
        http://www.cs.colorado.edu/cucs/Home.html
        http://www.cs.colorado.edu/~hardy/Home.html
        </LeafNodes>

            As shown in the example configuration file, you may classify a URL as a RootNode or a LeafNode. For a LeafNode URL, the Gatherer simply retrieves the URL and processes it. LeafNode URLs are typically files like PostScript papers or compressed ``tar'' distributions. For a RootNode URL, the Gatherer will expand it into zero or more LeafNode URLs by recursively enumerating it in an access method-specific way. For FTP or Gopher, the Gatherer will perform a recursive directory listing on the FTP or Gopher server to expand the RootNode (typically a directory name). For HTTP, a RootNode URL is expanded by following the embedded HTML links to other URLs. For News, the enumeration returns all the messages in the specified USENET newsgroup.

  PLEASE BE CAREFUL when specifying RootNodes as it is possible to specify an enormous amount of work with a single RootNode URL. To help prevent a misconfigured Gatherer from abusing servers or running wildly, by default the Gatherer will only expand a RootNode into 250 LeafNodes, and will only include HTML links that point to documents that reside on the same server as the original RootNode URL. There are several options that allow you to change these limits and otherwise enhance the Gatherer specification. See Section 4.3 for details.

    Note: Harvest is not intended to operate as a ``robot'', since it does not collect new URLs to retrieve other than those specified in RootNodes (of course, if you specify many high-level RootNodes you can make it operate as a robot, but that is not the intended use for the system). The Gatherer is HTTP Version 1.0 compliant, and sends the User-Agent and From request fields to HTTP servers for accountability.

    After you have written the Gatherer configuration file, create a directory for the Gatherer and copy the configuration file there. Then, run the Gatherer program with the configuration file as the only command-line argument, as shown below:

        % Gatherer GathName.cf

The Gatherer will generate a database of the content summaries, a log file ( log.gatherer), and an error log file ( log.errors). It will also exportgif the indexing information automatically to Brokers and other clients. To view the exported indexing information, you can use the gather client program, as shown below (see Appendix A for usage information):

     

        % gather localhost 8500 | more

The -info option causes the Gatherer to respond only with the Gatherer summary information, which consists of the attributes available in the specified Gatherer's database, the Gatherer's host and name, the range of object update times, and the number of objects. Compression is the default, but can be disabled with the -nocompress option. The optional timestamp tells the Gatherer to send only the objects that have changed since the specified timestamp (in seconds since the UNIX ``epoch'' of January 1, 1970).





next up previous contents index
Next: Gathering News URLs Up: 4 The Gatherer Previous: 4.1 Overview



Duane Wessels
Wed Jan 31 23:46:21 PST 1996