next up previous contents index
Next: 5 The Broker Up: 4 The Gatherer Previous: 4.7.7 Incorporating manually generated

4.8 Troubleshooting



Version 1.3 has an improved debugging facility. Extra information from specific programs and library routines can be logged by setting debugging flags. A debugging flag has the form -D section, level. Section is an integer in the range 1--255, and level is an integer in the range 1--9. Debugging flags can be given on a command line, with the Debug-Options: tag in a gatherer configuration file, or by setting the environment variable $HARVEST_DEBUG. Examples:

        Debug-Options: -D68,5 -D44,1
        % httpenum -D20,1 -D21,1 -D42,1
        % setenv HARVEST_DEBUG '-D20,1 -D23,1 -D63,1'

Debugging sections and levels have been assigned to the following sections of the code:

section  20, level 1                Common liburl URL processing
section  21, level 1, 5             Common liburl HTTP routines
section  22, level 1                Common liburl disk cache routines
section  23, level 1                Common liburl FTP routines
section  24, level 1                Common liburl Gopher routines
section  25, level 1                urlget - standalone liburl program.
section  26, level 1                ftpget - standalone liburl program.
section  40, level 1, 5, 9          Gatherer URL enumeration
section  41, level 1                Gatherer enumeration URL verification
section  42, level 1, 5, 9          Gatherer enumeration for HTTP
section  43, level 1                Gatherer enumeration for Gopher
section  44, level 1, 5             Gatherer enumeration filter routines
section  45, level 1                Gatherer enumeration for FTP
section  46, level 1                Gatherer enumeration for file:// URLs
section  48, level 1, 5             Gatherer enumeration robots.txt stuff
section  60, level 1                Gatherer essence data object processing
section  61, level 1                Gatherer essence database routines
section  62, level 1                Gatherer essence main
section  63, level 1                Gatherer essence type recognition
section  64, level 1                Gatherer essence object summarizing
section  65, level 1                Gatherer essence object unnesting
section  66, level 1                Gatherer essence post-summarizing
section  69, level 1, 5, 9          Common SOIF template processing
section  80, level 1                Common utilities memory management
section  81, level 1                Common utilities buffer routines
section  82, level 1                Common utilities system(3) routines
section  83, level 1                Common utilities pathname routines
section  84, level 1                Common utilities hostname processing
section  85, level 1                Common utilities string processing
section  86, level 1                Common utilities DNS host cache
section 102, level 1                Broker Glimpse indexing engine

The Gatherer doesn't pick up all the objects pointed to by some of my RootNodes.

The Gatherer places various limits on enumeration to prevent a misconfigured Gatherer from abusing servers or running wildly. See section 4.3 for details on how to override these limits.

Local-Mapping did not work for me---it retrieved the objects via the usual remote access protocols.

A local mapping will fail if:

So for directories, symbolic links, and CGI scripts, the HTTP server is always contacted. We don't perform URL translation for local mappings. If your URL's have funny characters that must be escaped, then the local mapping will also fail. Add debug option -D20,1 to understand how local mappings are taking place.

Using the --full-text option I see a lot of raw data in the content summaries, with few keywords I can search.

At present --full-text simply includes the full data content in the SOIF summaries. Using the individual file type summarizing mechanism described in Section 4.5.4 will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence --full-text option to perform content extraction before including the full text of documents.

No indexing terms are being generated in the SOIF summary for the META tags in my HTML documents.

This probably indicates that your HTML is not syntactically well-formed, and hence the SGML SGML-based HTML summarizer is not able to recognize it. See Section 4.5.2 for details and debugging options.

Gathered data are not being updated.

The Gatherer does not automatically do periodic updates. See Section 4.7.5 for details.

The Gatherer puts slightly different URLs in the SOIF summaries than I specified in the Gatherer configuration file.

This happens because the Gatherer attempts to put URLs into a canonical format. It does this by removing default port numbers http bookmarks, and similar cosmetic changes. Also, by default, Essence (the content extraction subsystem within the Gatherer) removes the standard types, which includes HTTP-Query (the cgi-bin stuff).


There are no Last-Modification-Time or MD5 attributes in my gatherered SOIF data, so the Broker can't do duplicate elimination.

If you gatherer remote, manually-created information (as in our PC Software Broker), it is pulled into Harvest using ``exploders'' that translate from the remote format into SOIF. That means they don't have a direct way to fill in the Last-Modification-Time or MD5 information per record. Note also that this will mean one update to the remote records would cause all records to look updated, which will result in more network load for Brokers that collect from this Gatherer's data. As a solution, you can compute MD5s for all objects, and store them as part of the record. Then, when you run the exploder you only generate timestamps for the ones for which the MD5s changed---giving you real last-modification times.

The Gatherer substitutes a ``%7e'' for a ``~'' in all the user directory URLs.

Staring with Harvest Version 1.2 we changed the Gatherer to conform to RFC 1738 [3], which says that a tilde inside a URL should be encoded as ``%7e'' in, because it is considered an ``unsafe'' character.

When I search using keywords I know are in a document I have indexed with Harvest, the document isn't found.

Harvest uses a content extraction subsystem called Essence that by default does not extract every keyword in a document. Instead, it uses heuristics to try to select promising keywords. You can change what keywords are selected by customizing the summarizers for that type of data, as discussed in Section 4.5.4. Or, you can tell Essence to use full text summarizing if you feel the added disk space costs are merited, as discussed in Section 4.7.1.

I'm running Harvest on HP-UX, but the essence process in the Gatherer takes too much memory.

The supplied regular expression library has memory leaks on HP-UX, so you need to use the regular expression library supplied with HP-UX. Change the Makefile in src/gatherer/essence to read:

        REGEX_INCLUDE   =
        REGEX_OBJ       =
        REGEX_TYPE      = posix

I built the configuration files to customize how Essence types/content extracts data, but it uses the standard typing/extracting mechanisms anyway.

Verify that you have the Lib-Directory set to the lib/ directory that you put your configuration files. Lib-Directory is defined in your Gatherer configuration file.

Essence dumps core when run (from the Gatherer)

Check if you're running a non-stock version of the Domain Naming System (DNS) under SunOS. There is a version that fixes some security holes, but is not compatible with the version of the DNS resolver library with which we link essence for the binary Harvest distribution. If this is indeed the problem, you can either run the binary Harvest distribution on a stock SunOS machine, or rebuild Harvest from source (more specifically, rebuild essence, linking with the non-stock DNS resolver library).


I am having problems resolving host names on SunOS.

In order to gather data from hosts outside of your organization, your system must be able to resolve fully qualified domain names into IP addresses. If your system cannot resolve hostnames, you will see error messages such as ``Unknown Host.'' In this case, either:

  1. the hostname you gave does not really exist; or
  2. your system is not configured to use the DNS.

To verify that your system is configured for DNS, make sure that the file /etc/resolv.conf exists and is readable. Read the resolv.conf(5) manual page for information on this file. You can verify that DNS is working with the nslookup command.

The Harvest executables for SunOS (4.1.3_U1) are statically linked with the stock resolver library from /usr/lib/libresolv.a. If you seem to have problems with the statically linked executables, please try to compile Harvest from the source code (see Section 3). This will make use of your local libraries, which may have been modified for your particular organization.

Some sites may use Sun Microsystem's Network Information Service (NIS) instead of, or in addition to, DNS. We believe that Harvest works on systems where NIS has been properly configured. The NIS servers (the names of which you can determine from the ypwhich command) must be configured to query DNS servers for hostnames they do not know about. See the -b option of the ypxfr command.

We would welcome reports of Harvest successfully working with NIS. Please email us at


I cannot get the Gatherer to work across our firewall gateway.

Harvest only supports retrieving HTTP objects through a proxy. It is not yet possible to request Gopher and FTP objects through a firewall. For these objects, you may need to run Harvest internally (behind the firewall) or or else on the firewall host itself.

If you see the ``Host is unreachable'' message, these are the likely problems:

  1. your connection to the Internet is temporarily down due to a circuit or routing failure; or
  2. you are behind a firewall.

If you see the ``Connection refused'' message, the likely problem is that you are trying to connect with an unused port on the destination machine. In other words, there is no program listening for connections on that port.

The Harvest gatherer is essentially a WWW client. You should expect it to work the same as Mosaic, but without proxy support. We would be interested to hear about problems with Harvest and hostnames under the condition that the gatherer is unable to contact a host, yet you are able to use other network programs (Mosaic, telnet, ping) to that host without going through a proxy.


next up previous contents index
Next: 5 The Broker Up: 4 The Gatherer Previous: 4.7.7 Incorporating manually generated

Duane Wessels
Wed Jan 31 23:46:21 PST 1996