Harvest User's Manual: The Gatherer

The Gatherer retrieves information resources using a variety of standard access methods (FTP, Gopher, HTTP, NNTP, and local files), and then summarizes those resources in various type-specific ways to generate structured indexing information. For example, a Gatherer can retrieve a technical report from an FTP archive, and then extract the author, title, and abstract from the paper to summarize the technical report. Harvest Brokers or other search services can then retrieve the indexing information from the Gatherer to use in a searchable index available via a WWW interface.

The Gatherer consists of a number of separate components. The Gatherer program reads a Gatherer configuration file and controls the overall process of enumerating and summarizing data objects.

The structured indexing information that the Gatherer collects is represented as a list of attribute-value pairs using the Summary Object Interchange Format (SOIF, see Section The Summary Object Interchange Format (SOIF)). The gatherd daemon serves the Gatherer database to Brokers. It hangs around, in the background, after a gathering session is complete. A stand-alone gather program is a client for the gatherd server. It can be used from the command line for testing, and is used by the Broker. The Gatherer uses a local disk cache to store objects it has retrieved. The disk cache is described in Section The local disk cache.

Even though the gatherd daemon remains in the background, a Gatherer does not automatically update or refresh its summary objects. Each object in a Gatherer has a Time-to-Live value. Objects remain in the database until they expire. See Section Periodic gathering and realtime updates for more information on keeping Gatherer objects up to date.

Several example Gatherers are provided with the Harvest software distribution (see Section Gatherer Examples).

4.2 Basic setup

To run a basic Gatherer, you need only list the Uniform Resource Locators (URLs, see RFC1630 and RFC1738) from which it will gather indexing information. This list is specified in the Gatherer configuration file, along with other optional information such as the Gatherer's name and the directory in which it resides (see Section Setting variables in the Gatherer configuration file for details on the optional information). Below is an example Gatherer configuration file:


        #
        #  sample.cf - Sample Gatherer Configuration File
        #
        Gatherer-Name:    My Sample Harvest Gatherer
        Gatherer-Port:    8500
        Top-Directory:    /usr/local/harvest/gatherers/sample

        <RootNodes>
        # Enter URLs for RootNodes here
        http://www.mozilla.org/
        http://www.xfree86.org/
        </RootNodes>

        <LeafNodes>
        # Enter URLs for LeafNodes here
        http://www.arco.de/~kj/index.html
        </LeafNodes>

As shown in the example configuration file, you may classify an URL as a RootNode or a LeafNode. For a LeafNode URL, the Gatherer simply retrieves the URL and processes it. LeafNode URLs are typically files like PostScript papers or compressed ``tar'' distributions. For a RootNode URL, the Gatherer will expand it into zero or more LeafNode URLs by recursively enumerating it in an access method-specific way. For FTP or Gopher, the Gatherer will perform a recursive directory listing on the FTP or Gopher server to expand the RootNode (typically a directory name). For HTTP, a RootNode URL is expanded by following the embedded HTML links to other URLs. For News, the enumeration returns all the messages in the specified USENET newsgroup.

PLEASE BE CAREFUL when specifying RootNodes as it is possible to specify an enormous amount of work with a single RootNode URL. To help prevent a misconfigured Gatherer from abusing servers or running wildly, by default the Gatherer will only expand a RootNode into 250 LeafNodes, and will only include HTML links that point to documents that reside on the same server as the original RootNode URL. There are several options that allow you to change these limits and otherwise enhance the Gatherer specification. See Section RootNode specifications for details.

The Gatherer is a ``robot'' and collects URLs starting from the URLs specified in RootNodes. It obeys the robots.txt convention and the robots META tag. It also is HTTP Version 1.1 compliant and sends the User-Agent and From request fields to HTTP servers for accountability.

After you have written the Gatherer configuration file, create a directory for the Gatherer and copy the configuration file there. Then, run the Gatherer program with the configuration file as the only command-line argument, as shown below:


        % Gatherer GathName.cf

The Gatherer will generate a database of the content summaries, a log file (log.gatherer), and an error log file (log.errors). It will also start the gatherd daemon which exports the indexing information automatically to Brokers and other clients. To view the exported indexing information, you can use the gather client program, as shown below:


        % gather localhost 8500 | more

The -info option causes the Gatherer to respond only with the Gatherer summary information, which consists of the attributes available in the specified Gatherer's database, the Gatherer's host and name, the range of object update times, and the number of objects. Compression is the default, but can be disabled with the -nocompress option. The optional timestamp tells the Gatherer to send only the objects that have changed since the specified timestamp (in seconds since the UNIX ``epoch'' of January 1, 1970).

Gathering News URLs with NNTP

News URLs are somewhat different than the other access protocols because the URL generally does not contain a hostname. The Gatherer retrieves News URLs from an NNTP server. The name of this server must be placed in the environment variable $NNTPSERVER. It is probably a good idea to add this to your RunGatherer script. If the environment variable is not set, the Gatherer attempts to connect to a host named news at your site.

Cleaning out a Gatherer

Remember the Gatherer databases persists between runs. Objects remain in the databases until they expire. When experimenting with the gatherer, it is always a good idea to ``clean out'' the databases between runs. This is most easily accomplished by executing this command from the Gatherer directory:


        % rm -rf data tmp log.*

4.3 RootNode specifications

The RootNode specification facility described in Section Basic setup provides a basic set of default enumeration actions for RootNodes. Often it is useful to enumerate beyond the default limits, for example, to increase the enumeration limit beyond 250 URLs, or to allow site boundaries to be crossed when enumerating HTML links. It is possible to specify these and other aspects of enumeration, using the following syntax:


        <RootNodes>
        URL EnumSpec
        URL EnumSpec
        ...
        </RootNodes>

where EnumSpec is on a single line (using ``\'' to escape linefeeds), with the following syntax:


        URL=URL-Max[,URL-Filter-filename]  \
        Host=Host-Max[,Host-Filter-filename] \
        Access=TypeList \
        Delay=Seconds \
        Depth=Number \
        Enumeration=Enumeration-Program

The EnumSpec modifiers are all optional, and have the following meanings:

URL-Max

The number specified on the right hand side of the ``URL='' expression lists the maximum number of LeafNode URLs to generate at all levels of depth, from the current URL. Note that URL-Max is the maximum number of URLs that are generated during the enumeration, and not a limit on how many URLs can pass through the candidate selection phase (see Section Customizing the candidate selection step).

URL-Filter-filename

This is the name of a file containing a set of regular expression filters (see Section RootNode filters) to allow or deny particular LeafNodes in the enumeration. The default filter is $HARVEST_HOME/lib/gatherer/URL-filter-default which excludes many image and sound files.

Host-Max

The number specified on the right hand side of the ``Host='' expression lists the maximum number of hosts that will be touched during the RootNode enumeration. This enumeration actually counts hosts by IP address so that aliased hosts are properly enumerated. Note that this does not work correctly for multi-homed hosts, or for hosts with rotating DNS entries (used by some sites for load balancing heavily accessed servers).

Note: Prior to Harvest Version 1.2 the ``Host=...'' line was called ``Site=...''. We changed the name to ``Host='' because it is more intuitively meaningful (being a host count limit, not a site count limit). For backwards compatibility with older Gatherer configuration files, we will continue to treat ``Site='' as an alias for ``Host=''.

Host-Filter-filename

This is the name of a file containing a set of regular expression filters to allow or deny particular hosts in the enumeration. Each expression can specify both a host name (or IP address) and a port number (in case you have multiple servers running on different ports of the same server and you want to index only one). The syntax is ``hostname:port''.

Access

If the RootNode is an HTTP URL, then you can specify which access methods across which to enumerate. Valid access method types are: FILE, FTP, Gopher, HTTP, News, Telnet, or WAIS. Use a ``|'' character between type names to allow multiple access methods. For example, ``Access=HTTP|FTP|Gopher'' will follow HTTP, FTP, and Gopher URLs while enumerating an HTTP RootNode URL.

Note: We do not support cross-method enumeration from Gopher, because of the difficulty of ensuring that Gopher pointers do not cross site boundaries. For example, the Gopher URL gopher://powell.cs.colorado.edu:7005/1ftp3aftp.cs.washington.edu40pub/ would get an FTP directory listing of ftp.cs.washington.edu:/pub, even though the host part of the URL is powell.cs.colorado.edu.

Delay

This is the number of seconds to wait between server contacts. It defaults to one second, when not specified otherwise. Delay=3 will let the gatherer sleep 3 seconds between server contacts.

Depth

This is the maximum number of levels of enumeration that will be followed during gathering. Depth=0 means that there is no limit to the depth of the enumeration. Depth=1 means the specified URL will be retrieved, and all the URLs referenced by the specified URL will be retrieved; and so on for higher Depth values. In other words, the enumeration will follow links up to Depth steps away from the specified URL.

Enumeration-Program

This modifier adds a very flexible way to control a Gatherer. The Enumeration-Program is a filter which reads URLs as input and writes new enumeration parameters on output. See section Generic Enumeration program description for specific details.

By default, URL-Max defaults to 250, URL-Filter defaults to no limit, Host-Max defaults to 1, Host-Filter defaults to no limit, Access defaults to HTTP only, Delay defaults to 1 second, and Depth defaults to zero. There is no way to specify an unlimited value for URL-Max or Host-Max.

RootNode filters

Filter files use the standard UNIX regular expression syntax (as defined by the POSIX standard), not the csh ``globbing'' syntax. For example, you would use ``.*abc'' to indicate any string ending with ``abc'', not ``*abc''. A filter file has the following syntax:


        Deny  regex
        Allow regex

The URL-Filter regular expressions are matched only on the URL-path portion of each URL (the scheme, hostname and port are excluded). For example, the following URL-Filter file would allow all URLs except those containing the regular expression ``/gatherers/'':


        Deny  /gatherers/
        Allow .

Another common use of URL-filters is to prevent the Gatherer from travelling ``up'' a directory. Automatically generated HTML pages for HTTP and FTP directories often contain a link for the parent directory ``..''. To keep the gatherer below a specific directory, use a URL-filter file such as:


        Allow ^/my/cool/sutff/
        Deny  .

The Host-Filter regular expressions are matched on the ``hostname:port'' portion of each URL. Because the port is included, you cannot use ``$'' to anchor the end of a hostname. Beginning with version 1.3, IP addresses may be specified in place of hostnames. A class B address such as 128.138.0.0 would be written as ``^128\.138\..*'' in regular expression syntax. For example:


        Deny   bcn.boulder.co.us:8080
        Deny   bvsd.k12.co.us
        Allow  ^128\.138\..*
        Deny   .

The order of the Allow and Deny entries is important, since the filters are applied sequentially from first to last. So, for example, if you list ``Allow .*'' first, no subsequent Deny expressions will be used, since this Allow filter will allow all entries.

Generic Enumeration program description

Flexible enumeration can be achieved by giving an Enumeration=Enumeration-Program modifier to a RootNode URL. The Enumeration-Program is a filter which takes URLs on standard input and writes new RootNode URLs on standard output.

The output format is different than specifying a RootNode URL in a Gatherer configuration file. Each output line must have nine fields separated by spaces. These fields are:


        URL
        URL-Max
        URL-Filter-filename
        Host-Max
        Host-Filter-filename
        Access
        Delay
        Depth
        Enumeration-Program

These are the same fields as described in section RootNode specifications. Values must be given for each field. Use /dev/null to disable the URL-Filter-filename and Host-Filter-filename. Use /bin/false to disable the Enumeration-Program.

Example RootNode configuration

Below is an example RootNode configuration:


        <RootNodes>
  (1)   http://harvest.cs.colorado.edu/               URL=100,MyFilter
  (2)   http://www.cs.colorado.edu/                   Host=50 Delay=60
  (3)   gopher://gopher.colorado.edu/                 Depth=1
  (4)   file://powell.cs.colorado.edu/home/hardy/     Depth=2
  (5)   ftp://ftp.cs.colorado.edu/pub/cs/techreports/ Depth=1
  (6)   http://harvest.cs.colorado.edu/~hardy/hotlist.html \
                Depth=1 Delay=60
  (7)   http://harvest.cs.colorado.edu/~hardy/ \
                Depth=2 Access=HTTP|FTP
        </RootNodes>

Each of the above RootNodes follows a different enumeration configuration as follows:

This RootNode will gather up to 100 documents that pass through the URL name filters contained within the file MyFilter.
This RootNode will gather the documents from up to the first 50 hosts it encounters while enumerating the specified URL, with no limit on the Depth of link enumeration. It will also wait for 60 seconds between each retrieval.
This RootNode will gather only the documents from the top-level menu of the Gopher server at gopher.colorado.edu.
This RootNode will gather all documents that are in the /home/hardy directory, or that are in any subdirectory of /home/hardy.
This RootNode will gather only the documents that are in the /pub/techreports directory which, in this case, is some bibliographic files rather than the technical reports themselves.
This RootNode will gather all documents that are within 1 step away from the specified RootNode URL, waiting 60 seconds between each retrieval. This is a good method by which to index your hotlist. By putting an HTML file containing ``hotlist'' pointers as this RootNode, this enumeration will gather the top-level pages to all of your hotlist pointers.
This RootNode will gather all documents that are at most 2 steps away from the specified RootNode URL. Furthermore, it will follow and enumerate any HTTP or FTP URLs that it encounters during enumeration.

Gatherer enumeration vs. candidate selection

In addition to using the URL-Filter and Host-Filter files for the RootNode specification mechanism described in Section RootNode specifications, you can prevent documents from being indexed through customizing the stoplist.cf file, described in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps. Since these mechanisms are invoked at different times, they have different effects. The URL-Filter and Host-Filter mechanisms are invoked by the Gatherer's ``RootNode'' enumeration programs. Using these filters as stop lists can prevent unwanted objects from being retrieved across the network. This can dramatically reduce gathering time and network traffic.

The stoplist.cf file is used by the Essence content extraction system (described in Section Extracting data for indexing: The Essence summarizing subsystem) after the objects are retrieved, to select which objects should be content extracted and indexed. This can be useful because Essence provides a more powerful means of rejecting indexing candidates, in which you can customize based not only file naming conventions but also on file contents (e.g., looking at strings at the beginning of a file or at UNIX ``magic'' numbers), and also by more sophisticated file-grouping schemes (e.g., deciding not to extract contents from object code files for which source code is available).

As an example of combining these mechanisms, suppose you want to index the ``.ps'' files linked into your WWW site. You could do this by having a stoplist.cf file that contains ``HTML'', and a RootNode URL-Filter that contains:


        Allow \.html
        Allow \.ps
        Deny  .*

As a final note, independent of these customizations the Gatherer attempts to avoid retrieving objects where possible, by using a local disk cache of objects, and by using the HTTP ``If-Modified-Since'' request header. The local disk cache is described in Section The local disk cache.

4.4 Generating LeafNode/RootNode URLs from a program

It is possible to generate RootNode or LeafNode URLs automatically from program output. This might be useful when gathering a large number of Usenet newsgroups, for example. The program is specified inside the RootNode or LeafNode section, preceded by a pipe symbol.


        <LeafNodes>
        |generate-news-urls.sh
        </LeafNodes>

The script must output valid URLs, such as


        news:comp.unix.voodoo
        news:rec.pets.birds
        http://www.nlanr.net/
        ...

In the case of RootNode URLs, enumeration parameters can be given after the program.


        <RootNodes>
        |my-fave-sites.pl Depth=1 URL=5000,url-filter
        </RootNodes>

4.5 Extracting data for indexing: The Essence summarizing subsystem

After the Gatherer retrieves a document, it passes the document through a subsystem called Essence to extract indexing information. Essence allows the Gatherer to collect indexing information easily from a wide variety of information, using different techniques depending on the type of data and the needs of the particular corpus being indexed. In a nutshell, Essence can determine the type of data pointed to by a URL (e.g., PostScript vs. HTML), ``unravel'' presentation nesting formats (such as compressed ``tar'' files), select which types of data to index (e.g., don't index Audio files), and then apply a type-specific extraction algorithm (called a summarizer) to the data to generate a content summary. Users can customize each of these aspects, but often this is not necessary. Harvest is distributed with a ``stock'' set of type recognizers, presentation unnesters, candidate selectors, and summarizers that work well for many applications.

Below we describe the stock summarizer set, the current components distribution, and how users can customize summarizers to change how they operate and add summarizers for new types of data. If you develop a summarizer that is likely to be useful to other users, please notify us via email at lee@arco.de so we may include it in our Harvest distribution.


Type            Summarizer Function
--------------------------------------------------------------------
Bibliographic   Extract author and titles
Binary          Extract meaningful strings and manual page summary
C, CHeader      Extract procedure names, included file names, and comments
Dvi             Invoke the Text summarizer on extracted ASCII text
FAQ, FullText, README
                Extract all words in file
Font            Extract comments
HTML            Extract anchors, hypertext links, and selected fields
LaTex           Parse selected LaTex fields (author, title, etc.)
Mail            Extract certain header fields
Makefile        Extract comments and target names
ManPage         Extract synopsis, author, title, etc., based on ``-man'' macros
News            Extract certain header fields
Object          Extract symbol table
Patch           Extract patched file names
Perl            Extract procedure names and comments
PostScript      Extract text in word processor-specific fashion, and pass
                through Text summarizer.
RCS, SCCS       Extract revision control summary
RTF             Up-convert to HTML and pass through HTML summarizer
SGML            Extract fields named in extraction table
ShellScript     Extract comments
SourceDistribution
                Extract full text of README file and comments from Makefile
                and source code files, and summarize any manual pages
SymbolicLink    Extract file name, owner, and date created
TeX             Invoke the Text summarizer on extracted ASCII text
Text            Extract first 100 lines plus first sentence of each
                remaining paragraph
Troff           Extract author, title, etc., based on ``-man'', ``-ms'',
                ``-me'' macro packages, or extract section headers and
                topic sentences.
Unrecognized    Extract file name, owner, and date created.

Default actions of ``stock'' summarizers

The table in Section Extracting data for indexing: The Essence summarizing subsystem provides a brief reference for how documents are summarized depending on their type. These actions can be customized, as discussed in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps. Some summarizers are implemented as UNIX programs while others are expressed as regular expressions; see Section Customizing the summarizing step or Section Example 4 for more information about how to write a summarizer.

Summarizing SGML data

It is possible to summarize documents that conform to the Standard Generalized Markup Language (SGML), for which you have a Document Type Definition (DTD). The World Wide Web's Hypertext Mark-up Language (HTML) is actually a particular application of SGML, with a corresponding DTD. (In fact, the Harvest HTML summarizer can use the HTML DTD and our SGML summarizing mechanism, which provides various advantages; see Section The SGML-based HTML summarizer.) SGML is being used in an increasingly broad variety of applications, for example as a format for storing data for a number of physical sciences. Because SGML allows documents to contain a good deal of structure, Harvest can summarize SGML documents very effectively.

The SGML summarizer (SGML.sum) uses the sgmls program by James Clark to parse the SGML document. The parser needs both a DTD for the document and a Declaration file that describes the allowed character set. The SGML.sum program uses a table that maps SGML tags to SOIF attributes.

Location of support files

SGML support files can be found in $HARVEST_HOME/lib/gatherer/sgmls-lib/. For example, these are the default pathnames for HTML summarizing using the SGML summarizing mechanism:


        $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/html.dtd
        $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.decl
        $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl

The location of the DTD file must be specified in the sgmls catalog ($HARVEST_HOME/lib/gatherer/sgmls-lib/catalog). For example:


        DOCTYPE   HTML   HTML/html.dtd

The SGML.sum program looks for the .decl file in the default location. An alternate pathname can be specified with the -d option to SGML.sum.

The summarizer looks for the .sum.tbl file first in the Gatherer's lib directory and then in the default location. Both of these can be overridden with the -t option to SGML.sum.

The SGML to SOIF table

The translation table provides a simple yet powerful way to specify how an SGML document is to be summarized. There are four ways to map SGML data into SOIF. The first two are concerned with placing the content of an SGML tag into a SOIF attribute.

A simple SGML-to-SOIF mapping looks like this:


        <TAG>              soif1,soif2,...

This places the content that occurs inside the tag ``TAG'' into the SOIF attributes ``soif1'' and ``soif2''. It is possible to select different SOIF attributes based on SGML attribute values. For example, if ``ATT'' is an attribute of ``TAG'', then it would be written like this:


        <TAG,ATT=x>         x-stuff
        <TAG,ATT=y>         y-stuff
        <TAG>               stuff

The second two mappings place values of SGML attributes into SOIF attributes. To place the value of the ``ATT'' attribute of the ``TAG'' tag into the ``att-stuff'' SOIF attribute you would write:


        <TAG:ATT>           att-stuff

It is also possible to place the value of an SGML attribute into a SOIF attribute named by a different SOIF attribute:


        <TAG:ATT1>          $ATT2

When the summarizer encounters an SGML attribute not listed in the table, the content is passed to the parent tag and becomes a part of the parent's content. To force the content of some tag not to be passed up, specify the SOIF attribute as ``ignore''. To force the content of some tag to be passed to the parent in addition to being placed into a SOIF attribute, list an addition SOIF attribute named ``parent''.

Please see Section The SGML-based HTML summarizer for examples of these mappings.

Errors and warnings from the SGML Parser

The sgmls parser can generate an overwhelming volume of error and warning messages. This will be especially true for HTML documents found on the Internet, which often do not conform to the strict HTML DTD. By default, errors and warnings are redirected to /dev/null so that they do not clutter the Gatherer's log files. To enable logging of these messages, edit the SGML.sum Perl script and set $syntax_check = 1.

Creating a summarizer for a new SGML-tagged data type

To create an SGML summarizer for a new SGML-tagged data type with an associated DTD, you need to do the following:

Write a shell script named FOO.sum which simply contains


        #!/bin/sh
        exec SGML.sum FOO $*

Modify the essence configuration files (as described in Section Customizing the type recognition step) so that your documents get typed as FOO.
Create the directory $HARVEST_HOME/lib/gatherer/sgmls-lib/FOO/ and copy your DTD and Declaration there as FOO.dtd and FOO.decl. Edit $HARVEST_HOME/lib/gatherer/sgmls-lib/catalog and add FOO.dtd to it.
Create the translation table FOO.sum.tbl and place it with the DTD in $HARVEST_HOME/lib/gatherer/sgmls-lib/FOO/.

At this point you can test everything from the command line as follows:


        % FOO.sum myfile.foo

The SGML-based HTML summarizer

Harvest can summarize HTML using the generic SGML summarizer described in Section Summarizing SGML data. The advantage of this approach is that the summarizer is more easily customizable, and fits with the well-conceived SGML model (where you define DTDs for individual document types and build interpretation software to understand DTDs rather than individual document types). The downside is that the summarizer is now pickier about syntax, and many Web documents are not syntactically correct. Because of this pickiness, the default is for the HTML summarizer to run with syntax checking outputs disabled. If your documents are so badly formed that they confuse the parser, this may mean the summarizing process dies unceremoniously. If you find that some of your HTML documents do not get summarized or only get summarized in part, you can turn syntax-checking output on by setting $syntax_check = 1 in $HARVEST_HOME/lib/gatherer/SGML.sum. That will allow you to see which documents are invalid and where.

Note that part of the reason for this problem is that Web browsers do not insist on well-formed documents. So, users can easily create documents that are not completely valid, yet display fine.

Below is the default SGML-to-SOIF table used by the HTML summarizer:


HTML ELEMENT   SOIF ATTRIBUTES
------------   -----------------------
    <A>             keywords,parent
    <A:HREF>        url-references
    <ADDRESS>       address
    <B>             keywords,parent
    <BODY>          body
    <CITE>          references
    <CODE>          ignore
    <EM>            keywords,parent
    <H1>            headings
    <H2>            headings
    <H3>            headings
    <H4>            headings
    <H5>            headings
    <H6>            headings
    <HEAD>          head
    <I>             keywords,parent
    <IMG:SRC>       images
    <META:CONTENT>  $NAME
    <STRONG>        keywords,parent
    <TITLE>         title
    <TT>            keywords,parent
    <UL>            keywords,parent

The pathname to this file is $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl.

Individual Gatherers may do customized HTML summarizing by placing a modified version of this file in the Gatherer lib directory. Another way to customize is to modify the HTML.sum script and add a -t option to the SGML.sum command. For example:


        SGML.sum -t $HARVEST_HOME/lib/my-HTML.table HTML $*

In HTML, the document title is written as:


        <TITLE>My Home Page</TITLE>

The above translation table will place this in the SOIF summary as:


        title{13}:  My Home Page

Note that ``keywords,parent'' occurs frequently in the table. For any specially marked text (bold, emphasized, hypertext links, etc.), the words will be copied into the keywords attribute and also left in the content of the parent element. This keeps the body of the text readable by not removing certain words.

Any text that appears inside a pair of CODE tags will not show up in the summary because we specified ``ignore'' as the SOIF attribute.

URLs in HTML anchors are written as:


        <A HREF="http://harvest.cs.colorado.edu/">

The specification for <A:HREF> in the above translation table causes this to appear as:


        url-references{32}: http://harvest.cs.colorado.edu/

Adding META data to your HTML

One of the most useful HTML tags is META. This allows the document writer to include arbitrary metadata in an HTML document. A Typical usage of the META element is:


        <META NAME="author" CONTENT="Joe T. Slacker">

By specifying ``<META:CONTENT> $NAME'' in the translation table, this comes out as:


        author{15}: Joe T. Slacker

Using the META tags, HTML authors can easily add a list of keywords to their documents:


        <META NAME="keywords" CONTENT="word1 word2">
        <META NAME="keywords" CONTENT="word3 word4">

Other examples

A very terse HTML summarizer could be specified with a table that only puts emphasized words into the keywords attribute:


HTML ELEMENT   SOIF ATTRIBUTES
------------   -----------------------
    <A>             keywords
    <B>             keywords
    <EM>            keywords
    <H1>            keywords
    <H2>            keywords
    <H3>            keywords
    <I>             keywords
    <META:CONTENT>  $NAME
    <STRONG>        keywords
    <TITLE>         title,keywords
    <TT>            keywords

Conversely, a full-text summarizer can be easily specified with only:


HTML ELEMENT   SOIF ATTRIBUTES
------------   -----------------------
    <HTML>          full-text
    <TITLE>         title,parent

Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps

The Harvest Gatherer's actions are defined by a set of configuration and utility files, and a corresponding set of executable programs referenced by some of the configuration files.

If you want to customize a Gatherer, you should create bin and lib subdirectories in the directory where you are running the Gatherer, and then copy $HARVEST_HOME/lib/gatherer/*.cf and $HARVEST_HOME/lib/gatherer/magic into your lib directory. Then add to your Gatherer configuration file:


        Lib-Directory:         lib

The details about what each of these files does are described below. The basic contents of a typical Gatherer's directory is as follows (note: some of the file names below can be changed by setting variables in the Gatherer configuration file, as described in Section Setting variables in the Gatherer configuration file):


        RunGatherd*    bin/           GathName.cf    log.errors     tmp/
        RunGatherer*   data/          lib/           log.gatherer

        bin:
        MyNewType.sum*

        data:
        All-Templates.gz    INFO.soif    PRODUCTION.gdbm    gatherd.log
        INDEX.gdbm          MD5.gdbm     gatherd.cf

        lib:
        bycontent.cf   byurl.cf       quick-sum.cf
        byname.cf      magic          stoplist.cf

        tmp:

The RunGatherd and RunGatherer are used to export the Gatherer's database after a machine reboot and to run the Gatherer, respectively. The log.errors and log.gatherer files contain error messages and the output of the Essence typing step, respectively (Essence will be described shortly). The GathName.cf file is the Gatherer's configuration file.

The bin directory contains any summarizers and any other program needed by the summarizers. If you were to customize the Gatherer by adding a summarizer, you would place those programs in this bin directory; the MyNewType.sum is an example.

The data directory contains the Gatherer's database which gatherd exports. The Gatherer's database consists of the All-Templates.gz, INDEX.gdbm, INFO.soif, MD5.gdbm and PRODUCTION.gdbm files. The gatherd.cf file is used to support access control as described in Section Controlling access to the Gatherer's database. The gatherd.log file is where the gatherd program logs its information.

The lib directory contains the configuration files used by the Gatherer's subsystems, namely Essence. These files are described briefly in the following table:


        bycontent.cf    Content parsing heuristics for type recognition step
        byname.cf       File naming heuristics for type recognition step
        byurl.cf        URL naming heuristics for type recognition step
        magic           UNIX ``file'' command specifications (matched against
                        bycontent.cf strings)
        quick-sum.cf    Extracts attributes for summarizing step.
        stoplist.cf     File types to reject during candidate selection

Customizing the type recognition step

Essence recognizes types in three ways (in order of precedence): by URL naming heuristics, by file naming heuristics, and by locating identifying data within a file using the UNIX file command.

To modify the type recognition step, edit lib/byname.cf to add file naming heuristics, or lib/byurl.cf to add URL naming heuristics, or lib/bycontent.cf to add by-content heuristics. The by-content heuristics match the output of the UNIX file command, so you may also need to edit the lib/magic file. See Section Example 3 and Example 4 for detailed examples on how to customize the type recognition step.

Customizing the candidate selection step

The lib/stoplist.cf configuration file contains a list of types that are rejected by Essence. You can add or delete types from lib/stoplist.cf to control the candidate selection step.

To direct Essence to index only certain types, you can list the types to index in lib/allowlist.cf. Then, supply Essence with the --allowlist flag.

The file and URL naming heuristics used by the type recognition step (described in Section Customizing the type recognition step) are particularly useful for candidate selection when gathering remote data. They allow the Gatherer to avoid retrieving files that you don't want to index (in contrast, recognizing types by locating identifying data within a file requires that the file be retrieved first). This approach can save quite a bit of network traffic, particularly when used in combination with enumerated RootNode URLs. For example, many sites provide each of their files in both a compressed and uncompressed form. By building a lib/allowlist.cf containing only the Compressed types, you can avoid retrieving the uncompressed versions of the files.

Customizing the presentation unnesting step

Some types are declared as ``nested'' types. Essence treats these differently than other types, by running a presentation unnesting algorithm or ``Exploder'' on the data rather than a Summarizer. At present Essence can handle files nested in the following formats:

binhex
uuencode
shell archive (``shar'')
tape archive (``tar'')
bzip2 compressed (``bzip2'')
compressed
GNU compressed (``gzip'')
zip compressed archive

To customize the presentation unnesting step you can modify the Essence source file src/gatherer/essence/unnest.c. This file lists the available presentation encodings, and also specifies the unnesting algorithm. Typically, an external program is used to unravel a file into one or more component files (e.g. bzip2, gunzip, uudecode, and tar).

An Exploder may also be used to explode a file into a stream of SOIF objects. An Exploder program takes a URL as its first command-line argument and a file containing the data to use as its second, and then generates one or more SOIF objects as output. For your convenience, the Exploder type is already defined as a nested type. To save some time, you can use this type and its corresponding Exploder.unnest program rather than modifying the Essence code.

See Section Example 2 for a detailed example on writing an Exploder. The unnest.c file also contains further information on defining the unnesting algorithms.

Customizing the summarizing step

Essence supports two mechanisms for defining the type-specific extraction algorithms (called Summarizers) that generate content summaries: a UNIX program that takes as its only command line argument the filename of the data to summarize, and line-based regular expressions specified in lib/quick-sum.cf. See Section Example 4 for detailed examples on how to define both types of Summarizers.

The UNIX Summarizers are named using the convention TypeName.sum (e.g., PostScript.sum). These Summarizers output their content summary in a SOIF attribute-value list (see Section The Summary Object Interchange Format (SOIF)). You can use the wrapit command to wrap raw output into the SOIF format (i.e., to provide byte-count delimiters on the individual attribute-value pairs).

There is a summarizer called FullText.sum that you can use to perform full text indexing of selected file types, by simply setting up the lib/bycontent.cf and lib/byname.cf configuration files to recognize the desired file types as FullText (i.e., using ``FullText'' in column 1 next to the matching regular expression).

4.6 Post-Summarizing: Rule-based tuning of object summaries

It is possible to ``fine-tune'' the summary information generated by the Essence summarizers. A typical application of this would be to change the Time-to-Live attribute based on some knowledge about the objects. So an administrator could use the post-summarizing feature to give quickly-changing objects a lower TTL, and very stable documents a higher TTL.

Objects are selected for post-summarizing if they meet a specified condition. A condition consists of three parts: An attribute name, an operation, and some string data. For example:


        city == 'New York'

In this case we are checking if the city attribute is equal to the string `New York'. For exact string matching, the string data must be enclosed in single quotes. Regular expressions are also supported:


        city ~ /New York/

Negative operators are also supported:


        city != 'New York'
        city !~ /New York/

Conditions can be joined with `&&' (logical and) or `||' (logical or) operators:


        city == 'New York' && state != 'NY';

When all conditions are met for an object, some number of instructions are executed on it. There are four types of instructions which can be specified:

Set an attribute exactly to some specific string. Example:
time-to-live = "86400"
Filter an attribute through some program. The attribute value is given as input to the filter. The output of the filter becomes the new attribute value. Example:
keywords | tr A-Z a-z
Filter multiple attributes through some program. In this case the filter must read and write attributes in the SOIF format. Example:
address,city,state,zip ! cleanup-address.pl
A special case instruction is to delete an object. To do this, simply write:
delete()

The Rules file

The conditions and instructions are combined together in a ``rules'' file. The format of this file is somewhat similar to a Makefile; conditions begin in the first column and instructions are indented by a tab-stop.

Example:


        type == 'HTML'
                partial-text | cleanup-html-text.pl

        URL ~ /users/
                time-to-live = "86400"
                partial-text ! extract-owner.sh

        type == 'SOIFStream'
                delete()

This rules file is specified in the gatherer.cf file with the Post-Summarizing tag, e.g.:


        Post-Summarizing: lib/myrules

Rewriting URLs

Until version 1.4 it was not possible to rewrite the URL-part of an object summary. It is now possible, but only by using the ``pipe'' instruction. This may be useful for people wanting to run a Gatherer on file:// URLs, but have them appear as http:// URLs. This can be done with a post-summarizing rule such as:


        url ~ 'file://localhost/web/htdocs/'
                url | fix-url.pl

And the 'fix-url.pl' script might look like:


        #!/usr/local/bin/perl -p
        s'file://localhost/web/htdocs/'http://www.my.domain/';

4.7 Gatherer administration

Setting variables in the Gatherer configuration file

In addition to customizing the steps described in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps, you can customize the Gatherer by setting variables in the Gatherer configuration file. This file consists of two parts: a list of variables that specify information about the Gatherer (such as its name, host, and port number), and two lists of URLs (divided into RootNodes and LeafNodes) from which to collect indexing information. Section Basic setup shows an example Gatherer configuration file. In this section we focus on the variables that the user can set in the first part of the Gatherer configuration file.

Each variable name starts in the first column, ends with a colon, then is followed by the value. The following table shows the supported variables:


        Access-Delay:           Default delay between URLs accesses.
        Data-Directory:         Directory where GDBM database is written.
        Debug-Options:          Debugging options passed to child programs.
        Errorlog-File:          File for logging errors.
        Essence-Options:        Any extra options to pass to Essence.
        FTP-Auth:               Username/password for protected FTP documents.
        Gatherd-Inetd:          Denotes that gatherd is run from inetd.
        Gatherer-Host:          Full hostname where the Gatherer is run.
        Gatherer-Name:          A Unique name for the Gatherer.
        Gatherer-Options:       Extra options for the Gatherer.
        Gatherer-Port:          Port number for gatherd.
        Gatherer-Version:       Version string for the Gatherer.
        HTTP-Basic-Auth:        Username/password for protected HTTP documents.
        HTTP-Proxy:             host:port of your HTTP proxy.
        Keep-Cache:             ``yes'' to not remove local disk cache.
        Lib-Directory:          Directory where configuration files live.
        Local-Mapping:          Mapping information for local gathering.
        Log-File:               File for logging progress.
        Post-Summarizing:       A rules-file for post-summarizing.
        Refresh-Rate:           Object refresh-rate in seconds, default 1 week.
        Time-To-Live:           Object time-to-live in seconds, default 1 month.
        Top-Directory:          Top-level directory for the Gatherer.
        Working-Directory:      Directory for tmp files and local disk cache.

Notes:

We recommend that you use the Top-Directory variable, since it will set the Data-Directory, Lib-Directory, and Working-Directory variables.
Both Working-Directory and Data-Directory will have files in them after the Gatherer has run. The Working-Directory will hold the local-disk cache that the Gatherer uses to reduce network I/O, and the Data-Directory will hold the GDBM databases that contain the content summaries.
You should use full rather than relative pathnames.
All variable definitions must come before the RootNode or LeafNode URLs.
Any line that starts with a ``#'' is a comment.
Local-Mapping is discussed in Section Local file system gathering for reduced CPU load.
HTTP-Proxy will retrieve HTTP URLs via a proxy host. The syntax is hostname:port; for example, proxy.yoursite.com:3128.
Essence-Options is particularly useful, as it lets you customize basic aspects of the Gatherer easily.
The only valid Gatherer-Options is --save-space which directs the Gatherer to be more space efficient when preparing its database for export.
The Gatherer program will accept the -background flag which will cause the Gatherer to run in the background.

The Essence options are:


Option                  Meaning
--------------------------------------------------------------------
--allowlist filename    File with list of types to allow
--fake-md5s             Generates MD5s for SOIF objects from a .unnest program
--fast-summarizing      Trade speed for some consistency.  Use only when
                        an external summarizer is known to generate clean,
                        unique attributes.
--full-text             Use entire file instead of summarizing.  Alternatively,
                        you can perform full text indexing of individual file
                        types by using the FullText.sum summarizer.
--max-deletions n       Number of GDBM deletions before reorganization
--minimal-bookkeeping   Generates a minimal amount of bookkeeping attrs
--no-access             Do not read contents of objects
--no-keywords           Do not automatically generate keywords
--stoplist filename     File with list of types to remove
--type-only             Only type data; do not summarize objects

A particular note about full text summarizing: Using the Essence --full-text option causes files not to be passed through the Essence content extraction mechanism. Instead, their entire content is included in the SOIF summary stream. In some cases this may produce unwanted results (e.g., it will directly include the PostScript for a document rather than first passing the data through a PostScript to text extractor, providing few searchable terms and large SOIF objects). Using the individual file type summarizing mechanism described in Section Customizing the summarizing step will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence --full-text option to perform content extraction before including the full text of documents.

Local file system gathering for reduced CPU load

Although the Gatherer's work load is specified using URLs, often the files being gathered are located on a local file system. In this case it is much more efficient to gather directly from the local file system than via FTP/Gopher/HTTP/News, primarily because of all the UNIX forking required to gather information via these network processes. For example, our measurements indicate it causes from 4-7x more CPU load to gather from FTP than directly from the local file system. For large collections (e.g., archive sites containing many thousands of files), the CPU savings can be considerable.

Starting with Harvest Version 1.1, it is possible to tell the Gatherer how to translate URLs to local file system names, using the Local-Mapping Gatherer configuration file variable (see Section Setting variables in the Gatherer configuration file). The syntax is:


        Local-Mapping: URL_prefix local_path_prefix

This causes all URLs starting with URL_prefix to be translated to files starting with the prefix local_path_prefix while gathering, but to be left as URLs in the results of queries (so the objects can be retrieved as usual). Note that no regular expressions are supported here. As an example, the specification


        Local-Mapping: http://harvest.cs.colorado.edu/~hardy/ /homes/hardy/public_html/
        Local-Mapping: ftp://ftp.cs.colorado.edu/pub/cs/ /cs/ftp/

would cause the URL http://harvest.cs.colorado.edu/~hardy/Home.html to be translated to the local file name /homes/hardy/public_html/Home.html, while the URL ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/Harvest.Conf.ps.Z would be translated to the local file name /cs/ftp/techreports/schwartz/Harvest.Conf.ps.Z.

Local gathering will work over NFS file systems. A local mapping will fail if: the local file cannot be opened for reading; or the local file is not a regular file; or the local file has execute bits set. So, for directories, symbolic links and CGI scripts, the server is always contacted rather than the local file system. Lastly, the Gatherer does not perform any URL syntax translations for local mappings. If your URL has characters that should be escaped (as in RFC1738), then the local mapping will fail. Starting with version 1.4 patchlevel 2 Essence will print [L] after URLs which were successfully accessed locally.

Note that if your network is highly congested, it may actually be faster to gather via HTTP/FTP/Gopher than via NFS, because NFS becomes very inefficient in highly congested situations. Even better would be to run local Gatherers on the hosts where the disks reside, and access them directly via the local file system.

Gathering from password-protected servers

You can gather password-protected documents from HTTP and FTP servers. In both cases, you can specify a username and password as a part of the URL. The format is as follows:


         ftp://user:password@host:port/url-path
        http://user:password@host:port/url-path

With this format, the ``user:password'' part is kept as a part of the URL string all throughout Harvest. This may enable anyone who uses your Broker(s) to access password-protected documents.

You can keep the username and password information ``hidden'' by specifying the authentication information in the Gatherer configuration file. For HTTP, the format is as follows:


        HTTP-Basic-Auth: realm username password

where realm is the same as the AuthName parameter given in an Apache httpd httpd.conf or .htaccess file. In other httpd server configuration, the realm value is sometimes called ServerId.

For FTP, the format in the gatherer.cf file is


        FTP-Auth: hostname[:port] username password

Controlling access to the Gatherer's database

You can use the gatherd.cf file (placed in the Data-Directory of a Gatherer) to control access to the Gatherer's database. A line that begins with Allow is followed by any number of domain or host names that are allowed to connect to the Gatherer. If the word all is used, then all hosts are matched. Deny is the opposite of Allow. The following example will only allow hosts in the cs.colorado.edu or usc.edu domain access the Gatherer's database:


        Allow  cs.colorado.edu usc.edu
        Deny   all

Periodic gathering and realtime updates

The Gatherer program does not automatically do any periodic updates -- when you run it, it processes the specified URLs, starts up a gatherd daemon (if one isn't already running), and then exits. If you want to update the data periodically (e.g., to capture new files as they are added to an FTP archive), you need to use the UNIX cron command to run the Gatherer program at some regular interval.

To set up periodic gathering via cron, use the RunGatherer command that RunHarvest will create. An example RunGatherer script follows:


        #!/bin/sh
        #
        #  RunGatherer - Runs the ATT 800 Gatherer (from cron)
        #
        HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME
        PATH=${HARVEST_HOME}/bin:${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/lib:$PATH
        export PATH
        NNTPSERVER=localhost; export NNTPSERVER
        cd /usr/local/harvest/gatherers/att800
        exec Gatherer "att800.cf"

You should run the RunGatherd command from your system startup (e.g. /etc/rc.local) file, so the Gatherer's database is exported each time the machine reboots. An example RunGatherd script follows:


        #!/bin/sh
        #
        #  RunGatherd - starts up the gatherd process (from /etc/rc.local)
        #
        HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME
        PATH=${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/bin:$PATH; export PATH
        exec gatherd -d /usr/local/harvest/gatherers/att800/data 8500

The local disk cache

The Gatherer maintains a local disk cache of files it gathers to reduce network traffic from restarting aborted gathering attempts. However, since the remote server must still be contacted whenever Gatherer runs, please do not set your cron job to run Gatherer frequently. A typical value might be weekly or monthly, depending on how congested the network and how important it is to have the most current data.

By default, the Gatherer's local disk cache is deleted after each successful completion. To save the local disk cache between Gatherer sessions, define Keep-Cache: yes in your Gatherer configuration file (Section Setting variables in the Gatherer configuration file).

If you want your Broker's index to reflect new data, then you must run the Gatherer and run a Broker collection. By default, a Broker will perform collections once a day. If you want the Broker to collect data as soon as it's gathered, then you will need to coordinate the timing of the completion of the Gatherer and the Broker collections.

If you run your Gatherer frequently and you use the Keep-Cache: yes in your Gatherer configuration file, then the Gatherer's local disk cache may interfere with retrieving updates. By default, objects in the local disk cache expire after 7 days; however, you can expire objects more quickly by setting the $GATHERER_CACHE_TTL environment variable to the number of seconds for the Time-To-Live (TTL) before you run the Gatherer, or you can change RunGatherer to remove the Gatherer's tmp directory after each Gatherer run. For example, to expire objects in the local disk cache after one day:


        % setenv GATHERER_CACHE_TTL 86400       # one day
        % ./RunGatherer

The Gatherer's local disk cache size defaults to 32 MBs, but you can change this value by setting the $HARVEST_MAX_LOCAL_CACHE environment variable to the number of MBs before you run the Gatherer. For example, to have a maximum cache of 10 MB you can do as follows:


        % setenv HARVEST_MAX_LOCAL_CACHE 10       # 10 MB
        % ./RunGatherer

If you have access to the software that creates the files that you are indexing (e.g., if all updates are funneled through a particular editor, update script, or system call), you can modify this software to schedule realtime Gatherer updates whenever a file is created or updated. For example, if all users update the files being indexed using a particular program, this program could be modified to run the Gatherer upon completion of the user's update.

Note that, when used in conjunction with cron, the Gatherer provides a powerful data ``mirroring'' facility. You can use the Gatherer to replicate the contents of one or more sites, retrieve data in multiple formats via multiple protocols (FTP, HTTP, etc.), optionally perform a variety of type- or site-specific transformations on the data, and serve the results very efficiently as compressed SOIF object summary streams to other sites that wish to use the data for building indexes or for other purposes.

Incorporating manually generated information into a Gatherer

You may want to inspect the quality of the automatically-generated SOIF templates. In general, Essence's techniques for automatic information extraction produce imperfect results. Sometimes it is possible to customize the summarizers to better suit the particular context (see Section Customizing the summarizing step). Sometimes, however, it makes sense to augment or change the automatically generated keywords with manually entered information. For example, you may want to add Title attributes to the content summaries for a set of PostScript documents (since it's difficult to parse them out of PostScript automatically).

Harvest provides some programs that automatically clean up a Gatherer's database. The rmbinary program removes any binary data from the templates. The cleandb program does some simple validation of SOIF objects, and when given the -truncate flag it will truncate the Keywords data field to 8 kilobytes. To help in manually managing the Gatherer's databases, the gdbmutil GDBM database management tool is provided in $HARVEST_HOME/lib/gatherer.

In a future release of Harvest we will provide a forms-based mechanism to make it easy to provide manual annotations. In the meantime, you can annotate the Gatherer's database with manually generated information by using the mktemplate, template2db, mergedb, and mkindex programs. You first need to create a file (called, say, annotations) in the following format:


        @FILE { url1
        Attribute-Name-1:        DATA
        Attribute-Name-2:        DATA
        ...
        Attribute-Name-n:        DATA
        }

        @FILE { url2
        Attribute-Name-1:        DATA
        Attribute-Name-2:        DATA
        ...
        Attribute-Name-n:        DATA
        }

        ...

Note that the Attributes must begin in column 0 and have one tab after the colon, and the DATA must be on a single line.

Next, run the mktemplate and template2db programs to generate SOIF and then GDBM versions of these data (you can have several files containing the annotations, and generate a single GDBM database from the above commands):


        % set path = ($HARVEST_HOME/lib/gatherer $path)
        % mktemplate annotations [annotations2 ...] | template2db annotations.gdbm

Finally, you run mergedb to incorporate the annotations into the automatically generated data, and mkindex to generate an index for it. The usage line for mergedb is:


        mergedb production automatic manual [manual ...]

The idea is that production is the final GDBM database that the Gatherer will serve. This is a new database that will be generated from the other databases on the command line. automatic is the GDBM database that a Gatherer automatically generated in a previous run (e.g., WORKING.gdbm or a previous PRODUCTION.gdbm). manual and so on are the GDBM databases that you manually created. When mergedb runs, it builds the production database by first copying the templates from the manual databases, and then merging in the attributes from the automatic database. In case of a conflict (the same attribute with different values in the manual and automatic databases), the manual values override the automatic values.

By keeping the automatically and manually generated data stored separately, you can avoid losing the manual updates when doing periodic automatic gathering. To do this, you will need to set up a script to remerge the manual annotations with the automatically gathered data after each gathering.

An example use of mergedb is:


        % mergedb PRODUCTION.new PRODUCTION.gdbm annotations.gdbm
        % mv PRODUCTION.new PRODUCTION.gdbm
        % mkindex

If the manual database looked like this:


        @FILE { url1
        my-manual-attribute:  this is a neat attribute
        }

and the automatic database looked like this:


        @FILE { url1
        keywords:   boulder colorado
        file-size:  1034
        md5:        c3d79dc037efd538ce50464089af2fb6
        }

then in the end, the production database will look like this:


        @FILE { url1
        my-manual-attribute:  this is a neat attribute
        keywords:   boulder colorado
        file-size:  1034
        md5:        c3d79dc037efd538ce50464089af2fb6
        }

4.8 Troubleshooting

Debugging

Extra information from specific programs and library routines can be logged by setting debugging flags. A debugging flag has the form -Dsection,level. Section is an integer in the range 1-255, and level is an integer in the range 1-9. Debugging flags can be given on a command line, with the Debug-Options: tag in a gatherer configuration file, or by setting the environment variable $HARVEST_DEBUG.

Examples:


        Debug-Options: -D68,5 -D44,1
        % httpenum -D20,1 -D21,1 -D42,1 http://harvest.cs.colorado.edu/
        % setenv HARVEST_DEBUG '-D20,1 -D23,1 -D63,1'

Debugging sections and levels have been assigned to the following sections of the code:


section  20, level 1, 5, 9          Common liburl URL processing
section  21, level 1, 5, 9          Common liburl HTTP routines
section  22, level 1, 5             Common liburl disk cache routines
section  23, level 1                Common liburl FTP routines
section  24, level 1                Common liburl Gopher routines
section  25, level 1                urlget - standalone liburl program.
section  26, level 1                ftpget - standalone liburl program.
section  40, level 1, 5, 9          Gatherer URL enumeration
section  41, level 1                Gatherer enumeration URL verification
section  42, level 1, 5, 9          Gatherer enumeration for HTTP
section  43, level 1, 5, 9          Gatherer enumeration for Gopher
section  44, level 1, 5             Gatherer enumeration filter routines
section  45, level 1                Gatherer enumeration for FTP
section  46, level 1                Gatherer enumeration for file:// URLs
section  48, level 1, 5             Gatherer enumeration robots.txt stuff
section  60, level 1                Gatherer essence data object processing
section  61, level 1                Gatherer essence database routines
section  62, level 1                Gatherer essence main
section  63, level 1                Gatherer essence type recognition
section  64, level 1                Gatherer essence object summarizing
section  65, level 1                Gatherer essence object unnesting
section  66, level 1, 2, 5          Gatherer essence post-summarizing
section  67, level 1                Gatherer essence object-ID code
section  69, level 1, 5, 9          Common SOIF template processing
section  70, level 1, 5, 9          Broker registry
section  71, level 1                Broker collection routines
section  72, level 1                Broker SOIF parsing routines
section  73, level 1, 5, 9          Broker registry hash tables
section  74, level 1                Broker storage manager routines
section  75, level 1, 5             Broker query manager routines
section  75, level 4                Broker query_list debugging
section  76, level 1                Broker event management routines
section  77, level 1                Broker main
section  78, level 9                Broker select(2) loop
section  79, level 1, 5, 9          Broker gatherer-id management
section  80, level 1                Common utilities memory management
section  81, level 1                Common utilities buffer routines
section  82, level 1                Common utilities system(3) routines
section  83, level 1                Common utilities pathname routines
section  84, level 1                Common utilities hostname processing
section  85, level 1                Common utilities string processing
section  86, level 1                Common utilities DNS host cache
section 101, level 1                Broker PLWeb indexing engine
section 102, level 1, 2, 5          Broker Glimpse indexing engine
section 103, level 1                Broker Swish indexing engine

Symptom

The Gatherer doesn't pick up all the objects pointed to by some of my RootNodes.

Solution

The Gatherer places various limits on enumeration to prevent a misconfigured Gatherer from abusing servers or running wildly. See section RootNode specifications for details on how to override these limits.

Symptom

Local-Mapping did not work for me - it retrieved the objects via the usual remote access protocols.

Solution

A local mapping will fail if:

the local filename cannot be opened for reading; or,
the local filename is not a regular file; or,
the local filename has execute bits set.

So for directories, symbolic links, and CGI scripts, the HTTP server is always contacted. We don't perform URL translation for local mappings. If your URL's have funny characters that must be escaped, then the local mapping will also fail. Add debug option -D20,1 to understand how local mappings are taking place.

Symptom

Using the --full-text option I see a lot of raw data in the content summaries, with few keywords I can search.

Solution

At present --full-text simply includes the full data content in the SOIF summaries. Using the individual file type summarizing mechanism described in Section Customizing the summarizing step will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence --full-text option to perform content extraction before including the full text of documents.

Symptom

No indexing terms are being generated in the SOIF summary for the META tags in my HTML documents.

Solution

This probably indicates that your HTML is not syntactically well-formed, and hence the SGML-based HTML summarizer is not able to recognize it. See Section Summarizing SGML data for details and debugging options.

Symptom

Gathered data are not being updated.

Solution

The Gatherer does not automatically do periodic updates. See Section Periodic gathering and realtime updates for details.

Symptom

The Gatherer puts slightly different URLs in the SOIF summaries than I specified in the Gatherer configuration file.

Solution

This happens because the Gatherer attempts to put URLs into a canonical format. It does this by removing default port numbers and similar cosmetic changes. Also, by default, Essence (the content extraction subsystem within the Gatherer) removes the standard stoplist.cf types, which includes HTTP-Query (the cgi-bin stuff).

Symptom

There are no Last-Modification-Time or MD5 attributes in my gatherered SOIF data, so the Broker can't do duplicate elimination.

Solution

If you gather remote, manually-created information, it is pulled into Harvest using ``exploders'' that translate from the remote format into SOIF. That means they don't have a direct way to fill in the Last-Modification-Time or MD5 information per record. Note also that this will mean one update to the remote records would cause all records to look updated, which will result in more network load for Brokers that collect from this Gatherer's data. As a solution, you can compute MD5s for all objects, and store them as part of the record. Then, when you run the exploder you only generate timestamps for the ones for which the MD5s changed - giving you real last-modification times.

Symptom

The Gatherer substitutes a ``%7e'' for a ``~'' in all the user directory URLs.

Solution

The Gatherer conforms to RFC1738, which says that a tilde inside a URL should be encoded as ``%7e'', because it is considered an ``unsafe'' character.

Symptom

When I search using keywords I know are in a document I have indexed with Harvest, the document isn't found.

Solution

Harvest uses a content extraction subsystem called Essence that by default does not extract every keyword in a document. Instead, it uses heuristics to try to select promising keywords. You can change what keywords are selected by customizing the summarizers for that type of data, as discussed in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps. Or, you can tell Essence to use full text summarizing if you feel the added disk space costs are merited, as discussed in Section Setting variables in the Gatherer configuration file.

Symptom

I'm running Harvest on HP-UX, but the essence process in the Gatherer takes too much memory.

Solution

The supplied regular expression library has memory leaks on HP-UX, so you need to use the regular expression library supplied with HP-UX. Change the Makefile in src/gatherer/essence to read:


        REGEX_DEFINE    = -DUSE_POSIX_REGEX
        REGEX_INCLUDE   =
        REGEX_OBJ       =
        REGEX_TYPE      = posix

Symptom

I built the configuration files to customize how Essence types/content extracts data, but it uses the standard typing/extracting mechanisms anyway.

Solution

Verify that you have the Lib-Directory set to the lib/ directory that you put your configuration files. Lib-Directory is defined in your Gatherer configuration file.

Symptom

I am having problems resolving host names on SunOS.

Solution

In order to gather data from hosts outside of your organization, your system must be able to resolve fully qualified domain names into IP addresses. If your system cannot resolve hostnames, you will see error messages such as ``Unknown Host.'' In this case, either:

the hostname you gave does not really exist; or
your system is not configured to use the DNS.

To verify that your system is configured for DNS, make sure that the file /etc/resolv.conf exists and is readable. Read the resolv.conf(5) manual page for information on this file. You can verify that DNS is working with the nslookup command.

Some sites may use Sun Microsystem's Network Information Service (NIS) instead of, or in addition to, DNS. We believe that Harvest works on systems where NIS has been properly configured. The NIS servers (the names of which you can determine from the ypwhich command) must be configured to query DNS servers for hostnames they do not know about. See the -b option of the ypxfr command.

Symptom

I cannot get the Gatherer to work across our firewall gateway.

Solution

Harvest only supports retrieving HTTP objects through a proxy. It is not yet possible to request Gopher and FTP objects through a firewall. For these objects, you may need to run Harvest internally (behind the firewall) or on the firewall host itself.

If you see the ``Host is unreachable'' message, these are the likely problems:

your connection to the Internet is temporarily down due to a circuit or routing failure; or
you are behind a firewall.

If you see the ``Connection refused'' message, the likely problem is that you are trying to connect with an unused port on the destination machine. In other words, there is no program listening for connections on that port.

The Harvest gatherer is essentially a WWW client. You should expect it to work the same as any Web browser.

Next Previous Contents

4. The Gatherer

4.1 Overview

4.2 Basic setup

Gathering News URLs with NNTP

Cleaning out a Gatherer

4.3 RootNode specifications

RootNode filters

Generic Enumeration program description

Example RootNode configuration

Gatherer enumeration vs. candidate selection

4.4 Generating LeafNode/RootNode URLs from a program

4.5 Extracting data for indexing: The Essence summarizing subsystem

Default actions of ``stock'' summarizers

Summarizing SGML data

Location of support files

The SGML to SOIF table

Errors and warnings from the SGML Parser

Creating a summarizer for a new SGML-tagged data type

The SGML-based HTML summarizer

Adding META data to your HTML

Other examples

Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps

Customizing the type recognition step

Customizing the candidate selection step

Customizing the presentation unnesting step

Customizing the summarizing step

4.6 Post-Summarizing: Rule-based tuning of object summaries

The Rules file

Rewriting URLs

4.7 Gatherer administration

Setting variables in the Gatherer configuration file

Local file system gathering for reduced CPU load

Gathering from password-protected servers

Controlling access to the Gatherer's database

Periodic gathering and realtime updates

The local disk cache

Incorporating manually generated information into a Gatherer

4.8 Troubleshooting