next up previous contents index
Next: C.3 Example 3 - Up: C Gatherer Examples Previous: C.1 Example 1 -

C.2 Example 2 - Incorporating manually generated information

 

The Gatherer is able to ``explode'' a resource into a stream of content summaries. This is useful for files that contain manually-generated information that may describe one or more resources, or for building a gateway between various structured formats and SOIF (see Appendix B).

This example demonstrates an exploder for the Linux Software Map (LSM) format. LSM files contain structured information (like the author, location, etc.) about software available for the Linux operating system. A demo of our LSM Gatherer and Broker is available.

To run this example, type:

        % cd $HARVEST_HOME/gatherers/example-2
        % ./RunGatherer

  To view the configuration file for this Gatherer, look at example-2.cf. Notice that the Gatherer has its own Lib-Directory (see Section 4.7.1 for help on writing configuration files). The library directory contains the typing and candidate selection customizations for Essence. In this example, we've only customized the candidate selection step. lib/stoplist.cf defines the types that Essence should not index. This example uses an empty stoplist.cf file to direct Essence to index all files.

  The Gatherer retrieves each of the LeafNode URLs, which are all Linux Software Map files from the Linux FTP archive tsx-11.mit.edu. The Gatherer recognizes that a ``.lsm'' file is LSM type because of the naming heuristic present in lib/byname.cf. The LSM type is a ``nested'' type as specified in the Essence source codegif. Exploder programs (named TypeName.unnest) are run on nested types rather than the usual summarizers. The LSM.unnest program is the standard exploder program that takes an LSM file and generates one or more corresponding SOIF objects. When the Gatherer finishes, it contains one or more corresponding SOIF objects for the software described within each LSM file.

After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type:

        % gather localhost 9222 | more

Because tsx-11.mit.edu is a popular and heavily loaded archive, the Gatherer often won't be able to retrieve the LSM files. If you suspect that something went wrong, look in log.errors and log.gatherer to try to determine the problem.

The following two SOIF objects were generated by this Gatherer. The first object is summarizes the LSM file itself, and the second object summarizes the software described in the LSM file.

        @FILE { ftp://tsx-11.mit.edu/pub/linux/docs/linux-doc-project/man-pages-1.4.lsm
        Time-to-Live{7}:        9676800
        Last-Modification-Time{9}:      781931042
        Refresh-Rate{7}:        2419200
        Gatherer-Name{25}:      Example Gatherer Number 2
        Gatherer-Host{22}:      powell.cs.colorado.edu
        Gatherer-Version{3}:    0.4
        Type{3}:        LSM
        Update-Time{9}: 781931042
        File-Size{3}:   848
        MD5{32}:        67377f3ea214ab680892c82906081caf
        }

        @FILE { ftp://ftp.cs.unc.edu/pub/faith/linux/man-pages-1.4.tar.gz
        Time-to-Live{7}:        9676800
        Last-Modification-Time{9}:      781931042
        Refresh-Rate{7}:        2419200
        Gatherer-Name{25}:      Example Gatherer Number 2
        Gatherer-Host{22}:      powell.cs.colorado.edu
        Gatherer-Version{3}:    0.4
        Update-Time{9}: 781931042
        Type{16}:       GNUCompressedTar
        Title{48}:      Section 2, 3, 4, 5, 7, and 9 man pages for Linux
        Version{3}:     1.4
        Description{124}:       Man pages for Linux.  Mostly section 2 is complete.  Section
        3 has over 200 man pages, but it still far from being finished.
        Author{27}:     Linux Documentation Project
        AuthorEmail{11}:        DOC channel
        Maintainer{9}:  Rik Faith
        MaintEmail{16}: faith@cs.unc.edu
        Site{45}:       ftp.cs.unc.edu
        sunsite.unc.edu
        tsx-11.mit.edu
        Path{94}:       /pub/faith/linux
        /pub/Linux/docs/linux-doc-project/man-pages
        /pub/linux/docs/linux-doc-project
        File{20}:       man-pages-1.4.tar.gz
        FileSize{4}:    170k
        CopyPolicy{47}: Public Domain or otherwise freely distributable
        Keywords{10}:   man
        pages

        Entered{24}:    Sun Sep 11 19:52:06 1994
        EnteredBy{9}:   Rik Faith
        CheckedEmail{16}:       faith@cs.unc.edu
        }

  We've also built a Gatherer that explodes about a half-dozen index files from various PC archives into more than 25,000 content summaries. Each of these index files contain hundreds of a one-line descriptions about PC software distributions that are available via anonymous FTP. We have a demo available via the Web.

 


next up previous contents index
Next: C.3 Example 3 - Up: C Gatherer Examples Previous: C.1 Example 1 -



Duane Wessels
Wed Jan 31 23:46:21 PST 1996