Harvest User's Manual <author> Darren R. Hardy, Michael F. Schwartz, Duane Wessels, Kang-Jin Lee <date>2002-10-29 <abstract> Harvest User's Manual was edited by Kang-Jin Lee and covers Harvest version 1.8. It was originally written by Darren R. Hardy, Michael F. Schwartz and Duane Wessels for Harvest 1.4.pl2 in 1996-01-31. <toc> <sect>Introduction to Harvest <p> HARVEST is an integrated set of tools to gather, extract, organize, and search information across the Internet. With modest effort users can tailor Harvest to digest information in many different formats, and offer custom search services on the Internet. A key goal of Harvest is to provide a flexible system that can be configured in various ways to create many types of indexes. Harvest also allows users to extract structured (attribute-value pair) information from many different information formats and build indexes that allow these attributes to be referenced during queries (e.g., searching for all documents with a certain regular expression in the title field). An important advantage of Harvest is that it allows users to build indexes using either manually constructed templates (for maximum control over index content) or automatically extracted data constructed templates (for easy coverage of large collections), or using a hybrid of the two methods. Harvest is designed to make it easy to distribute the search system on a pool of networked machines to handle higher load. <sect1>Copyright <p> The core of Harvest is licensed under <url url="../../COPYING" name="GPL">. Additional components distributed with Harvest are also under GPL or similar license. Glimpse, the current default fulltext indexer has a different license. Here is a clarification of <url url="../glimpse-license-status" name="Glimpse' copyright status"> kindly posted by <url url="mailto:gvelez@tucson.com" name="Golda Velez"> to <url url="news:comp.infosystems.harvest" name="comp.infosystems.harvest">. <sect1>Online Harvest Resources <p> This manual is available at <htmlurl url="http://harvest.sourceforge.net/harvest/doc/html/manual.html" name="harvest.sourceforge.net/harvest/doc/html/manual.html">. More information about Harvest is available at <htmlurl url="http://harvest.sourceforge.net/" name="harvest.sourceforge.net">. <sect>Subsystem Overview <p> Harvest consists of several subsystems. The <em>Gatherer</em> subsystem collects indexing information (such as keywords, author names, and titles) from the resources available at <em>Provider</em> sites (such as FTP and HTTP servers). The <em>Broker</em> subsystem retrieves indexing information from one or more Gatherers, suppresses duplicate information, incrementally indexes the collected information, and provides a WWW query interface to it. <label id="img1"> <figure loc="tbp"> <eps file="../images/img1.eps" height="10cm"> <img src="../images/img1.png"> <caption>Harvest Software Components</caption> </figure> You should start using Harvest simply, by installing a single ``stock'' (i.e., not customized) Gatherer and Broker on one machine to index some of the FTP, World Wide Web, and NetNews data at your site. After you get the system working in this basic configuration, you can invest additional effort as warranted. First, as you scale up to index larger volumes of information, you can reduce the CPU and network load to index your data by distributing the gathering process. Second, you can customize how Harvest extracts, indexes, and searches your information, to better match the types of data you have and the ways your users would like to interact with the data. We discuss how to distribute the gathering process in the next subsection. We cover various forms of customization in Section <ref id="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps" name="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps"> and in several parts of Section <ref id="The Broker" name="The Broker">. <sect1>Distributing the Gathering and Brokering Processes <p> Harvest Gatherers and Brokers can be configured in various ways. Running a Gatherer remotely from a Provider site allows Harvest to interoperate with sites that are not running Harvest Gatherers, by using standard object retrieval protocols like FTP, Gopher, HTTP, and NNTP. However, as suggested by the bold lines in the left side of Figure <ref id="img2" name="2">, this arrangement results in excess server and network load. Running a Gatherer locally is much more efficient, as shown in the right side of Figure <ref id="img2" name="2">. Nonetheless, running a Gatherer remotely is still better than having many sites independently collect indexing information, since many Brokers or other search services can share the indexing information that the Gatherer collects. If you have a number of FTP/HTTP/Gopher/NNTP servers at your site, it is most efficient to run a Gatherer on each machine where these servers run. On the other hand, you can reduce installation effort by running a Gatherer at just one machine at your site and letting it retrieve data from across the network. <label id="img2"> <figure loc="tbp"> <eps file="../images/img2.eps" height="10cm"> <img src="../images/img2.png"> <caption>Harvest Configuration Options</caption> </figure> Figure <ref id="img2" name="2"> also illustrates that a Broker can collect information from many Gatherers (to build an index of widely distributed information). Brokers can also retrieve information from other Brokers, in effect cascading indexed views from one another. Brokers retrieve this information using the query interface, allowing them to filter or refine the information from one Broker to the next. <sect>Installing the Harvest Software <label id="Installing the Harvest Software"> <p> <sect1>Requirements for Harvest Servers <p> <sect2>Hardware <p> A good machine for running a typical Harvest server will have a reasonably fast processor, 1-2 GB of free disk, and 128 MB of RAM. A slower CPU will work but it will slow down the Harvest server. More important than CPU speed, however, is memory size. Harvest uses a number of processes, some of which provide needed ``plumbing'' (e.g., <tt>search.cgi</tt>), and some of which improve performance (e.g., the <tt>glimpseserver</tt> process). If you do not have enough memory, your system will page too much, and drastically reduce performance. The other factor affecting RAM usage is how much data you are trying to index in a Harvest Broker. The more data, the more disk I/O will be performed at query time, the more RAM it will take to provide a reasonable sized disk buffer pool. The amount of disk you'll need depends on how much data you want to index in a single Broker. (It is possible to distribute your index over multiple Brokers if it gets too large for one disk.) A good rule of thumb is that you will need about 10% as much disk to hold the Gatherer and Broker databases as the total size of the data you want to index. The actual space needs will vary depending on the type of data you are indexing. For example, PostScript achieves a much higher indexing space reduction than HTML, because so much of the PostScript data (such as page positioning information) is discarded when building the index. <sect2>Platforms <p> To run a Harvest server, you need an UNIX-like Operating System. <sect2>Software <p> To use Harvest, you need the following software packages: <itemize> <item>All Harvest servers require: Perl v5.0 or higher. <item>The Harvest Broker and Gatherer require: GNU <tt>gzip</tt> v1.2.4 or higher. <item>The Harvest Broker requires: HTTP server. </itemize> To build Harvest from the source distribution you may need to install one or more of the following software packages: <itemize> <item>Compiling Harvest requires: GNU <tt>gcc</tt> v2.5.8 or higher. <item>Compiling the Harvest Broker requires: <tt>flex</tt> v2.4.7 or higher and <tt>bison</tt> v1.22 or higher. </itemize> The sources for <tt>gcc</tt>, <tt>gzip</tt>, <tt>flex</tt>, and <tt>bison</tt> are available at the <url url="ftp://ftp.gnu.org/" name="GNU FTP server">. <sect1>Requirements for Harvest Users <p> Anyone with a web browser (e.g., Internet Explorer, Lynx, Mozilla, Netscape, Opera, etc.) can access and use Harvest servers. <sect1>Retrieving and Installing the Harvest Software <p> <sect2>Distribution types <p> Currently we offer only source distribution of Harvest. The <em>source distribution</em> contains all of the source code for the Harvest software. There are no <em>binary distributions</em> of Harvest. You can retrieve the Harvest source distributions from the Harvest download site <htmlurl url="http://prdownloads.sourceforge.net/harvest/" name="prdownloads.sourceforge.net/harvest/">. <sect2>Harvest components <p> Harvest components are in the <em>components</em> directory. To use a component, follow the instructions included in the desired component directory. <sect2>User-contributed software <p> There is a collection of unsupported user-contributed software in <em>contrib</em> directory. If you would like to contribute some software, please send email to <url url="mailto:lee@arco.de" name="lee@arco.de">. <sect1>Building the Source Distribution <p> The source distribution can be extracted in any directory. The following command will extract the gnu-zipped source archive: <tscreen><verb> % gzip -dc harvest-x.y.z.tar.gz | tar xf - </verb></tscreen> For archives compressed with bzip2, use: <tscreen><verb> % bzip2 -dc harvest-x.y.z.tar.bz2 | tar xf - </verb></tscreen> Harvest uses GNU's <em>autoconf</em> package to perform needed configuration at installation time. If you want to override the default installation location of <em>/usr/local/harvest</em>, change the ``prefix'' variable when invoking ``configure''. If desired, you may edit <em>src/common/include/config.h</em> before compiling to change various Harvest compile-time limits and variables. To compile the source tree type <tt>make</tt>. For example, to build and install the entire Harvest system into <em>/usr/local/harvest</em> directory, type: <tscreen><verb> % ./configure % make % make install </verb></tscreen> You may see some compiler warning messages, which you can ignore. Building the entire Harvest distribution will take few minutes on a reasonably fast machine. The compiled source tree takes approximately 25 megabytes of disk space. Later, after the installed software working, you can remove the compiled code (``.o'' files) and other intermediate files by typing <tt>make clean</tt>. If you want to remove the configure-generated Makefiles, type <tt>make distclean</tt>. <sect1>Additional installation for the Harvest Broker <label id="Additional installation for the Harvest Broker"> <p> <sect2>Checking the installation for HTTP access <p> The Broker interacts with your HTTP server in a number of ways. You should make sure that the HTTP server can properly access the files it needs. In many cases, the HTTP server will run under a different userid than the owner of the Harvest files. First, make sure the HTTP server userid can read the <em>query.html</em> files in each broker directory. Second, make sure the HTTP server userid can access and execute the CGI programs in <em>$HARVEST_HOME/cgi-bin/</em>. The <tt>search.cgi</tt> script reads files from the <em>$HARVEST_HOME/cgi-bin/lib/</em> directory, so check that as well. Finally, check the files in <em>$HARVEST_HOME/lib/</em>. Some of the CGI Perl scripts require ``include'' files in this directory. <sect2>Required modifications to your HTTP server <p> The Harvest Broker requires that an HTTP server is running, and that the HTTP server ``knows'' about the Broker's files. Below are some examples of how to configure various HTTP servers to work with the Harvest Broker. <sect2>Apache httpd <p> Requires a <bf>ScriptAlias</bf> and an <bf>Alias</bf> entry in <em>httpd.conf</em>, e.g.: <tscreen><verb> ScriptAlias /Harvest/cgi-bin/ Your-HARVEST_HOME/cgi-bin/ Alias /Harvest/ Your-HARVEST_HOME/ </verb></tscreen> <em>WARNING:</em> The <bf>ScriptAlias</bf> entry must appear <em>before</em> the <bf>Alias</bf> entry. Additionally, it might be necessary to configure Apache httpd to follow <em>symbolic links</em>. To do this, add following to your <em>httpd.conf</em>: <tscreen><verb> <Directory Your-HARVEST_HOME> Options FollowSymLinks </Directory> </verb></tscreen> <sect2>Other HTTP servers <p> Install the HTTP server and modify its configuration file so that the <em>/Harvest</em> directory points to <em>$HARVEST_HOME</em>. You will also need to configure your HTTP server so that it knows that the directory <em>/Harvest/cgi-bin</em> contains valid CGI programs. If the default behaviour of your HTTP server is not to follow symbolik links, you will need to configure it so that it will follow symbolic links in the <em>/Harvest</em> directory. <sect1>Upgrading versions of the Harvest software <p> <sect2>Upgrading from version 1.6 to version 1.8 <p> You <em>can not</em> install version 1.8 on top of version 1.6. For example, the change from version 1.6 to version 1.8 included some reorganization of the executables, and hence simply installing version 1.8 on top of version 1.6 would cause you to use old executables in some cases. To upgrade from Harvest version 1.6 to 1.8, do: <enum> <item>Move your old installation to a temporary location. <item>Install the new version as directed by the release notes. <item>Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation. <descrip> <tag/Gatherers:/ you need to move the Gatherer's directory into <em>$HARVEST_HOME/gatherers</em>. Section <ref id="RootNode specifications" name="RootNode specifications"> describes the Gatherer workload specifications if you want to modify your Gatherer's configuration file. <tag/Brokers:/ rebuild your broker by using <tt>CreateBroker</tt> and merge in any customizations you have made to your old Broker. </descrip> </enum> <sect2>Upgrading from version 1.5 to version 1.6 <p> There are no known incompatibilities between versions 1.5 and 1.6. <sect2>Upgrading from version 1.4 to version 1.5 <p> You <em>can not</em> install version 1.5 on top of version 1.4. For example, the change from version 1.4 to version 1.5 included some reorganization of the executables, and hence simply installing version 1.5 on top of version 1.4 would cause you to use old executables in some cases. To upgrade from Harvest version 1.4 to 1.5, do: <enum> <item>Move your old installation to a temporary location. <item>Install the new version as directed by the release notes. <item>Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation. <descrip> <tag/Gatherers:/ you need to move the Gatherer's directory into <em>$HARVEST_HOME/gatherers</em>. Section <ref id="RootNode specifications" name="RootNode specifications"> describes the Gatherer workload specifications if you want to modify your Gatherer's configuration file. <tag/Brokers:/ you need to move the Broker's directory into <em>$HARVEST_HOME/brokers</em>. Remove any <em>.glimpse_*</em> files from your Broker's directory and use the <em>admin.html</em> interface to force a full-index. You may want, however, to rebuild your broker by using <tt>CreateBroker</tt> so that you can use the updated <em>query.html</em> and related files. </descrip> </enum> <sect2>Upgrading from version 1.3 to version 1.4 <p> There are no known incompatibilities between versions 1.3 and 1.4. <sect2>Upgrading from version 1.2 to version 1.3 <p> Version 1.3 is mostly backwards compatible with 1.2, with the following exception: Harvest 1.3 uses Glimpse 3.0. The <em>.glimpse_*</em> files in the broker directory created with Harvest 1.2 (Glimpse 2.0) are incompatible. After installing Harvest 1.3 you should: <enum> <item>Shutdown any running brokers. <item>Execute <tt>rm .glimpse_*</tt> in each broker directory. <item>Restart your brokers with <tt>RunBroker</tt>. <item>Force a full-index from the <em>admin.html</em> interface. </enum> <sect2>Upgrading from version 1.1 to version 1.2 <p> There are a few incompatabilities between Harvest version 1.1 and version 1.2. <itemize> <item>The Gatherer has improved incremental gatherering support which is incompatible with version 1.1. To update your existing Gatherer, change into the Gatherer's <em>Data-Directory</em> (usually the <em>data</em> subdirectory), and run the following command: <tscreen><verb> % set path = ($HARVEST_HOME/lib/gatherer $path) % cd data % rm -f INDEX.gdbm % mkindex </verb></tscreen> This should create the <em>INDEX.gdbm</em> and <em>MD5.gdbm</em> files in the current directory. <item>The Broker has a new log format for the <em>admin/LOG</em> file which is incompatible with version 1.1. </itemize> <sect2>Upgrading to version 1.1 from version 1.0 or older <p> If you already have an older version of Harvest installed, and want to upgrade, you <em>can not</em> unpack the new distribution on top of the old one. For example, the change from version 1.0 to version 1.1 included some reorganization of the executables, and hence simply installing version 1.1 on top of version 1.0 would cause you to use old executables in some cases. On the other hand, you may not want to start over from scratch with a new software version, as that would not take advantage of the data you have already gathered and indexed. Instead, to upgrade from Harvest version 1.0 to 1.1, do the following: <enum> <item>Move your old installation to a temporary location. <item>Install the new version as directed by the release notes. <item>Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation. <descrip> <tag/Gatherers:/ you need to move the Gatherer's directory into <em>$HARVEST_HOME/gatherers</em>. Section <ref id="RootNode specifications" name="RootNode specifications"> describes the new Gatherer workload specifications which were introduced in version 1.1; you may modify your Gatherer's configuration file to employ this new functionality. <tag/Brokers:/ you need to move the Broker's directory into <em>$HARVEST_HOME/brokers</em>. You may want, however, to rebuild your broker by using <tt>CreateBroker</tt> so that you can use the updated <em>query.html</em> and related files. </descrip> </enum> <sect1>Starting up the system: RunHarvest and related commands <label id="Starting up the system: RunHarvest and related commands"> <p> The simplest way to start the Harvest system is to use the <tt>RunHarvest</tt> command. <tt>RunHarvest</tt> prompts the user with a short list of questions about what data to index, etc., and then creates and runs a Gatherer and Broker with a ``stock'' (non-customized) set of content extraction and indexing mechanisms. Some more primitive commands are also available, for starting individual Gatherers and Brokers (e.g., if you want to distribute the gathering process). The Harvest startup commands are: <descrip> <tag/RunHarvest/ Checks that the Harvest software is installed correctly, prompts the user for basic configuration information, and then creates and runs a Gatherer and a Broker. If you have <em>$HARVEST_HOME</em> set, then it will use it; otherwise, it tries to determine <em>$HARVEST_HOME</em> automatically. Found in the <em>$HARVEST_HOME</em> directory. <tag/RunBroker/ Runs a Broker. Found in the Broker's directory. <tag/RunGatherer/ Runs a Gatherer. Found in the Gatherer's directory. <tag/CreateBroker/ Creates a single Broker which will collect its information from other existing Brokers or Gatherers. Used by <tt>RunHarvest</tt>, or can be run by a user to create a new Broker. Uses <em>$HARVEST_HOME</em>, and defaults to <em>/usr/local/harvest</em>. Found in the <em>$HARVEST_HOME/bin</em> directory. </descrip> There is no <tt>CreateGatherer</tt> command, but the <tt>RunHarvest</tt> command can create a Gatherer, or you can create a Gatherer manually (see Section <ref id="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps" name="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps"> or Section <ref id="Gatherer Examples" name="Gatherer Examples">). The layout of the installed Harvest directories and programs is discussed in Section <ref id="Programs and layout of the installed Harvest software" name="Programs and layout of the installed Harvest software">. Among other things, the <tt>RunHarvest</tt> command asks the user what port numbers to use when running the Gatherer and the Broker. By default, the Gatherer will use port 8500 and the Broker will use the Gatherer port plus 1. The choice of port numbers depends on your particular machine -- you need to choose ports that are not in use by other servers on your machine. You might look at your <em>/etc/services</em> file to see what ports are in use (although this file only lists some servers; other servers use ports without registering that information anywhere). Usually the above port numbers will not be in use by other processes. Probably the easiest thing is simply to try using the default port numbers, and see if it works. The remainder of this manual provides information for users who wish to customize or otherwise make more sophisticated use of Harvest than what happens when you install the system and run <tt>RunHarvest</tt>. <sect1>Harvest team contact information <p> If you have questions the about Harvest system or problems with the software, post a note to the USENET newsgroup <url url="news:comp.infosystems.harvest" name="comp.infosystems.harvest">. Please note your machine type, operating system type, and Harvest version number in your correspondence. If you have bug fixes, ports to new platforms or other software improvements, please email them to the Harvest maintainer <url url="mailto:lee@arco.de" name="lee@arco.de">. <sect>The Gatherer <label id="The Gatherer"> <p> <sect1>Overview <p> The Gatherer retrieves information resources using a variety of standard access methods (FTP, Gopher, HTTP, NNTP, and local files), and then summarizes those resources in various type-specific ways to generate structured indexing information. For example, a Gatherer can retrieve a technical report from an FTP archive, and then extract the author, title, and abstract from the paper to summarize the technical report. Harvest Brokers or other search services can then retrieve the indexing information from the Gatherer to use in a searchable index available via a WWW interface. The Gatherer consists of a number of separate components. The <tt>Gatherer</tt> program reads a Gatherer configuration file and controls the overall process of enumerating and summarizing data objects. The structured indexing information that the Gatherer collects is represented as a list of attribute-value pairs using the <em>Summary Object Interchange Format</em> (SOIF, see Section <ref id="The Summary Object Interchange Format (SOIF)" name="The Summary Object Interchange Format (SOIF)">). The <tt>gatherd</tt> daemon serves the Gatherer database to Brokers. It hangs around, in the background, after a gathering session is complete. A stand-alone <tt>gather</tt> program is a client for the <tt>gatherd</tt> server. It can be used from the command line for testing, and is used by the Broker. The Gatherer uses a local disk cache to store objects it has retrieved. The disk cache is described in Section <ref id="The local disk cache" name="The local disk cache">. Even though the <tt>gatherd</tt> daemon remains in the background, a Gatherer does not automatically update or refresh its summary objects. Each object in a Gatherer has a Time-to-Live value. Objects remain in the database until they expire. See Section <ref id="Periodic gathering and realtime updates" name="Periodic gathering and realtime updates"> for more information on keeping Gatherer objects up to date. Several example Gatherers are provided with the Harvest software distribution (see Section <ref id="Gatherer Examples" name="Gatherer Examples">). <sect1>Basic setup <label id="Basic setup"> <p> To run a basic Gatherer, you need only list the Uniform Resource Locators (URLs, see <htmlurl url="http://www.ietf.org/rfc/rfc1630.txt" name="RFC1630"> and <htmlurl url="http://www.ietf.org/rfc/rfc1738.txt" name="RFC1738">) from which it will gather indexing information. This list is specified in the Gatherer configuration file, along with other optional information such as the Gatherer's name and the directory in which it resides (see Section <ref id="Setting variables in the Gatherer configuration file" name="Setting variables in the Gatherer configuration file"> for details on the optional information). Below is an example Gatherer configuration file: <tscreen><verb> # # sample.cf - Sample Gatherer Configuration File # Gatherer-Name: My Sample Harvest Gatherer Gatherer-Port: 8500 Top-Directory: /usr/local/harvest/gatherers/sample <RootNodes> # Enter URLs for RootNodes here http://www.mozilla.org/ http://www.xfree86.org/ </RootNodes> <LeafNodes> # Enter URLs for LeafNodes here http://www.arco.de/~kj/index.html </LeafNodes> </verb></tscreen> As shown in the example configuration file, you may classify an URL as a <bf>RootNode</bf> or a <bf>LeafNode</bf>. For a LeafNode URL, the Gatherer simply retrieves the URL and processes it. LeafNode URLs are typically files like PostScript papers or compressed ``tar'' distributions. For a RootNode URL, the Gatherer will expand it into zero or more LeafNode URLs by recursively enumerating it in an access method-specific way. For FTP or Gopher, the Gatherer will perform a recursive directory listing on the FTP or Gopher server to expand the RootNode (typically a directory name). For HTTP, a RootNode URL is expanded by following the embedded HTML links to other URLs. For News, the enumeration returns all the messages in the specified USENET newsgroup. PLEASE BE CAREFUL when specifying RootNodes as it is possible to specify an enormous amount of work with a single RootNode URL. To help prevent a misconfigured Gatherer from abusing servers or running wildly, by default the Gatherer will only expand a RootNode into 250 LeafNodes, and will only include HTML links that point to documents that reside on the same server as the original RootNode URL. There are several options that allow you to change these limits and otherwise enhance the Gatherer specification. See Section <ref id="RootNode specifications" name="RootNode specifications"> for details. The Gatherer is a <htmlurl url="http://www.robotstxt.org/wc/robots.html" name="``robot''"> and collects URLs starting from the URLs specified in RootNodes. It obeys the <em>robots.txt</em> convention and the <em>robots META tag</em>. It also is <htmlurl url="http://www.ietf.org/rfc/rfc2616.txt" name="HTTP Version 1.1"> compliant and sends the <em>User-Agent</em> and <em>From</em> request fields to HTTP servers for accountability. After you have written the Gatherer configuration file, create a directory for the Gatherer and copy the configuration file there. Then, run the <tt>Gatherer</tt> program with the configuration file as the only command-line argument, as shown below: <tscreen><verb> % Gatherer GathName.cf </verb></tscreen> The Gatherer will generate a database of the content summaries, a log file (<em>log.gatherer</em>), and an error log file (<em>log.errors</em>). It will also start the <tt>gatherd</tt> daemon which exports the indexing information automatically to Brokers and other clients. To view the exported indexing information, you can use the <tt>gather</tt> client program, as shown below: <tscreen><verb> % gather localhost 8500 | more </verb></tscreen> The <bf>-info</bf> option causes the Gatherer to respond only with the Gatherer summary information, which consists of the attributes available in the specified Gatherer's database, the Gatherer's host and name, the range of object update times, and the number of objects. Compression is the default, but can be disabled with the <bf>-nocompress</bf> option. The optional timestamp tells the Gatherer to send only the objects that have changed since the specified timestamp (in seconds since the UNIX ``epoch'' of January 1, 1970). <sect2>Gathering News URLs with NNTP <p> News URLs are somewhat different than the other access protocols because the URL generally does not contain a hostname. The Gatherer retrieves News URLs from an NNTP server. The name of this server must be placed in the environment variable <em>$NNTPSERVER</em>. It is probably a good idea to add this to your <tt>RunGatherer</tt> script. If the environment variable is not set, the Gatherer attempts to connect to a host named <em>news</em> at your site. <sect2>Cleaning out a Gatherer <p> Remember the Gatherer databases persists between runs. Objects remain in the databases until they expire. When experimenting with the gatherer, it is always a good idea to ``clean out'' the databases between runs. This is most easily accomplished by executing this command from the Gatherer directory: <tscreen><verb> % rm -rf data tmp log.* </verb></tscreen> <sect1>RootNode specifications <label id="RootNode specifications"> <p> The RootNode specification facility described in Section <ref id="Basic setup" name="Basic setup"> provides a basic set of default enumeration actions for RootNodes. Often it is useful to enumerate beyond the default limits, for example, to increase the enumeration limit beyond 250 URLs, or to allow site boundaries to be crossed when enumerating HTML links. It is possible to specify these and other aspects of enumeration, using the following syntax: <tscreen><verb> <RootNodes> URL EnumSpec URL EnumSpec ... </RootNodes> </verb></tscreen> where <em>EnumSpec</em> is on a single line (using ``<bf>\</bf>'' to escape linefeeds), with the following syntax: <tscreen><verb> URL=URL-Max[,URL-Filter-filename] \ Host=Host-Max[,Host-Filter-filename] \ Access=TypeList \ Delay=Seconds \ Depth=Number \ Enumeration=Enumeration-Program </verb></tscreen> The <em>EnumSpec</em> modifiers are all optional, and have the following meanings: <descrip> <tag/URL-Max/ The number specified on the right hand side of the ``URL='' expression lists the maximum number of LeafNode URLs to generate at all levels of depth, from the current URL. Note that <em>URL-Max</em> is the maximum number of URLs that are generated during the enumeration, and <em>not</em> a limit on how many URLs can pass through the candidate selection phase (see Section <ref id="Customizing the candidate selection step" name="Customizing the candidate selection step">). <tag/URL-Filter-filename/ This is the name of a file containing a set of regular expression filters (see Section <ref id="RootNode filters" name="RootNode filters">) to allow or deny particular LeafNodes in the enumeration. The default filter is <em>$HARVEST_HOME/lib/gatherer/URL-filter-default</em> which excludes many image and sound files. <tag/Host-Max/ The number specified on the right hand side of the ``Host='' expression lists the maximum number of hosts that will be touched during the RootNode enumeration. This enumeration actually counts hosts by IP address so that aliased hosts are properly enumerated. Note that this does not work correctly for multi-homed hosts, or for hosts with rotating DNS entries (used by some sites for load balancing heavily accessed servers). <em>Note:</em> Prior to Harvest Version 1.2 the ``Host=...'' line was called ``Site=...''. We changed the name to ``Host='' because it is more intuitively meaningful (being a host count limit, not a site count limit). For backwards compatibility with older Gatherer configuration files, we will continue to treat ``Site='' as an alias for ``Host=''. <tag/Host-Filter-filename/ This is the name of a file containing a set of regular expression filters to allow or deny particular hosts in the enumeration. Each expression can specify both a host name (or IP address) and a port number (in case you have multiple servers running on different ports of the same server and you want to index only one). The syntax is ``hostname:port''. <tag/Access/ If the RootNode is an HTTP URL, then you can specify which access methods across which to enumerate. Valid access method types are: <bf>FILE, FTP, Gopher, HTTP, News, Telnet,</bf> or <bf>WAIS</bf>. Use a ``<bf>|</bf>'' character between type names to allow multiple access methods. For example, ``<bf>Access=HTTP|FTP|Gopher</bf>'' will follow HTTP, FTP, and Gopher URLs while enumerating an HTTP RootNode URL. <em>Note:</em> We do not support cross-method enumeration from Gopher, because of the difficulty of ensuring that Gopher pointers do not cross site boundaries. For example, the Gopher URL <em>gopher://powell.cs.colorado.edu:7005/1ftp3aftp.cs.washington.edu40pub/</em> would get an FTP directory listing of ftp.cs.washington.edu:/pub, even though the host part of the URL is powell.cs.colorado.edu. <tag/Delay/ This is the number of seconds to wait between server contacts. It defaults to one second, when not specified otherwise. <bf>Delay=3</bf> will let the gatherer sleep 3 seconds between server contacts. <tag/Depth/ This is the maximum number of levels of enumeration that will be followed during gathering. <bf>Depth=0</bf> means that there is <em>no</em> limit to the depth of the enumeration. <bf>Depth=1</bf> means the specified URL will be retrieved, and all the URLs referenced by the specified URL will be retrieved; and so on for higher Depth values. In other words, the enumeration will follow links up to <em>Depth</em> steps away from the specified URL. <tag/Enumeration-Program/ This modifier adds a very flexible way to control a Gatherer. The Enumeration-Program is a filter which reads URLs as input and writes new enumeration parameters on output. See section <ref id="Generic Enumeration program description" name="Generic Enumeration program description"> for specific details. </descrip> By default, <em>URL-Max</em> defaults to 250, <em>URL-Filter</em> defaults to no limit, <em>Host-Max</em> defaults to 1, <em>Host-Filter</em> defaults to no limit, <em>Access</em> defaults to HTTP only, <em>Delay</em> defaults to 1 second, and <em>Depth</em> defaults to zero. There is no way to specify an unlimited value for <em>URL-Max</em> or <em>Host-Max</em>. <sect2>RootNode filters <label id="RootNode filters"> <p> Filter files use the standard UNIX regular expression syntax (as defined by the POSIX standard), not the csh ``globbing'' syntax. For example, you would use ``.*abc'' to indicate any string ending with ``abc'', not ``*abc''. A filter file has the following syntax: <tscreen><verb> Deny regex Allow regex </verb></tscreen> The <em>URL-Filter</em> regular expressions are matched only on the URL-path portion of each URL (the scheme, hostname and port are excluded). For example, the following URL-Filter file would allow all URLs except those containing the regular expression ``<em>/gatherers/</em>'': <tscreen><verb> Deny /gatherers/ Allow . </verb></tscreen> Another common use of URL-filters is to prevent the Gatherer from travelling ``up'' a directory. Automatically generated HTML pages for HTTP and FTP directories often contain a link for the parent directory ``<em>..</em>''. To keep the gatherer below a specific directory, use a URL-filter file such as: <tscreen><verb> Allow ^/my/cool/sutff/ Deny . </verb></tscreen> The <em>Host-Filter</em> regular expressions are matched on the ``hostname:port'' portion of each URL. Because the port is included, you cannot use ``<bf>$</bf>'' to anchor the end of a hostname. Beginning with version 1.3, IP addresses may be specified in place of hostnames. A class B address such as 128.138.0.0 would be written as ``<bf>^128\.138\..*</bf>'' in regular expression syntax. For example: <tscreen><verb> Deny bcn.boulder.co.us:8080 Deny bvsd.k12.co.us Allow ^128\.138\..* Deny . </verb></tscreen> The order of the <bf>Allow</bf> and <bf>Deny</bf> entries is important, since the filters are applied sequentially from first to last. So, for example, if you list ``<bf>Allow .*</bf>'' first, no subsequent <bf>Deny</bf> expressions will be used, since this <bf>Allow</bf> filter will allow all entries. <sect2>Generic Enumeration program description <label id="Generic Enumeration program description"> <p> Flexible enumeration can be achieved by giving an <bf>Enumeration=Enumeration-Program</bf> modifier to a RootNode URL. The <em>Enumeration-Program</em> is a filter which takes URLs on standard input and writes new RootNode URLs on standard output. The output format is different than specifying a RootNode URL in a Gatherer configuration file. Each output line must have nine fields separated by spaces. These fields are: <tscreen><verb> URL URL-Max URL-Filter-filename Host-Max Host-Filter-filename Access Delay Depth Enumeration-Program </verb></tscreen> These are the same fields as described in section <ref id="RootNode specifications" name="RootNode specifications">. Values must be given for each field. Use <em>/dev/null</em> to disable the URL-Filter-filename and Host-Filter-filename. Use <tt>/bin/false</tt> to disable the Enumeration-Program. <sect2>Example RootNode configuration <p> Below is an example RootNode configuration: <tscreen><verb> <RootNodes> (1) http://harvest.cs.colorado.edu/ URL=100,MyFilter (2) http://www.cs.colorado.edu/ Host=50 Delay=60 (3) gopher://gopher.colorado.edu/ Depth=1 (4) file://powell.cs.colorado.edu/home/hardy/ Depth=2 (5) ftp://ftp.cs.colorado.edu/pub/cs/techreports/ Depth=1 (6) http://harvest.cs.colorado.edu/~hardy/hotlist.html \ Depth=1 Delay=60 (7) http://harvest.cs.colorado.edu/~hardy/ \ Depth=2 Access=HTTP|FTP </RootNodes> </verb></tscreen> Each of the above RootNodes follows a different enumeration configuration as follows: <enum> <item>This RootNode will gather up to 100 documents that pass through the URL name filters contained within the file <em>MyFilter</em>. <item>This RootNode will gather the documents from up to the first 50 hosts it encounters while enumerating the specified URL, with no limit on the Depth of link enumeration. It will also wait for 60 seconds between each retrieval. <item>This RootNode will gather only the documents from the top-level menu of the Gopher server at <em>gopher.colorado.edu</em>. <item>This RootNode will gather all documents that are in the <em>/home/hardy</em> directory, or that are in any subdirectory of <em>/home/hardy</em>. <item>This RootNode will gather only the documents that are in the <em>/pub/techreports</em> directory which, in this case, is some bibliographic files rather than the technical reports themselves. <item>This RootNode will gather all documents that are within 1 step away from the specified RootNode URL, waiting 60 seconds between each retrieval. This is a good method by which to index your hotlist. By putting an HTML file containing ``hotlist'' pointers as this RootNode, this enumeration will gather the top-level pages to all of your hotlist pointers. <item>This RootNode will gather all documents that are at most 2 steps away from the specified RootNode URL. Furthermore, it will follow and enumerate any HTTP or FTP URLs that it encounters during enumeration. </enum> <sect2>Gatherer enumeration vs. candidate selection <label id="Gatherer enumeration vs. candidate selection"> <p> In addition to using the <em>URL-Filter</em> and <em>Host-Filter</em> files for the RootNode specification mechanism described in Section <ref id="RootNode specifications" name="RootNode specifications">, you can prevent documents from being indexed through customizing the <em>stoplist.cf</em> file, described in Section <ref id="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps" name="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps">. Since these mechanisms are invoked at different times, they have different effects. The <em>URL-Filter</em> and <em>Host-Filter</em> mechanisms are invoked by the Gatherer's ``RootNode'' enumeration programs. Using these filters as stop lists can prevent unwanted objects from being retrieved across the network. This can dramatically reduce gathering time and network traffic. The <em>stoplist.cf</em> file is used by the <em>Essence</em> content extraction system (described in Section <ref id="Extracting data for indexing: The Essence summarizing subsystem" name="Extracting data for indexing: The Essence summarizing subsystem">) <em>after</em> the objects are retrieved, to select which objects should be content extracted and indexed. This can be useful because Essence provides a more powerful means of rejecting indexing candidates, in which you can customize based not only file naming conventions but also on file contents (e.g., looking at strings at the beginning of a file or at UNIX ``magic'' numbers), and also by more sophisticated file-grouping schemes (e.g., deciding not to extract contents from object code files for which source code is available). As an example of combining these mechanisms, suppose you want to index the ``.ps'' files linked into your WWW site. You could do this by having a <em>stoplist.cf</em> file that contains ``HTML'', and a RootNode <em>URL-Filter</em> that contains: <tscreen><verb> Allow \.html Allow \.ps Deny .* </verb></tscreen> As a final note, independent of these customizations the Gatherer attempts to avoid retrieving objects where possible, by using a local disk cache of objects, and by using the HTTP ``If-Modified-Since'' request header. The local disk cache is described in Section <ref id="The local disk cache" name="The local disk cache">. <sect1>Generating LeafNode/RootNode URLs from a program <p> It is possible to generate RootNode or LeafNode URLs automatically from program output. This might be useful when gathering a large number of Usenet newsgroups, for example. The program is specified inside the RootNode or LeafNode section, preceded by a pipe symbol. <tscreen><verb> <LeafNodes> |generate-news-urls.sh </LeafNodes> </verb></tscreen> The script must output valid URLs, such as <tscreen><verb> news:comp.unix.voodoo news:rec.pets.birds http://www.nlanr.net/ ... </verb></tscreen> In the case of RootNode URLs, enumeration parameters can be given after the program. <tscreen><verb> <RootNodes> |my-fave-sites.pl Depth=1 URL=5000,url-filter </RootNodes> </verb></tscreen> <sect1>Extracting data for indexing: The Essence summarizing subsystem <label id="Extracting data for indexing: The Essence summarizing subsystem"> <p> After the Gatherer retrieves a document, it passes the document through a subsystem called <em>Essence</em> to extract indexing information. Essence allows the Gatherer to collect indexing information easily from a wide variety of information, using different techniques depending on the type of data and the needs of the particular corpus being indexed. In a nutshell, Essence can determine the type of data pointed to by a URL (e.g., PostScript vs. HTML), ``unravel'' presentation nesting formats (such as compressed ``tar'' files), select which types of data to index (e.g., don't index Audio files), and then apply a type-specific extraction algorithm (called a <em>summarizer</em>) to the data to generate a content summary. Users can customize each of these aspects, but often this is not necessary. Harvest is distributed with a ``stock'' set of type recognizers, presentation unnesters, candidate selectors, and summarizers that work well for many applications. Below we describe the stock summarizer set, the current components distribution, and how users can customize summarizers to change how they operate and add summarizers for new types of data. If you develop a summarizer that is likely to be useful to other users, please notify us via email at <url url="mailto:lee@arco.de" name="lee@arco.de"> so we may include it in our Harvest distribution. <tscreen><verb> Type Summarizer Function -------------------------------------------------------------------- Bibliographic Extract author and titles Binary Extract meaningful strings and manual page summary C, CHeader Extract procedure names, included file names, and comments Dvi Invoke the Text summarizer on extracted ASCII text FAQ, FullText, README Extract all words in file Font Extract comments HTML Extract anchors, hypertext links, and selected fields LaTex Parse selected LaTex fields (author, title, etc.) Mail Extract certain header fields Makefile Extract comments and target names ManPage Extract synopsis, author, title, etc., based on ``-man'' macros News Extract certain header fields Object Extract symbol table Patch Extract patched file names Perl Extract procedure names and comments PostScript Extract text in word processor-specific fashion, and pass through Text summarizer. RCS, SCCS Extract revision control summary RTF Up-convert to HTML and pass through HTML summarizer SGML Extract fields named in extraction table ShellScript Extract comments SourceDistribution Extract full text of README file and comments from Makefile and source code files, and summarize any manual pages SymbolicLink Extract file name, owner, and date created TeX Invoke the Text summarizer on extracted ASCII text Text Extract first 100 lines plus first sentence of each remaining paragraph Troff Extract author, title, etc., based on ``-man'', ``-ms'', ``-me'' macro packages, or extract section headers and topic sentences. Unrecognized Extract file name, owner, and date created. </verb></tscreen> <sect2>Default actions of ``stock'' summarizers <p> The table in Section <ref id="Extracting data for indexing: The Essence summarizing subsystem" name="Extracting data for indexing: The Essence summarizing subsystem"> provides a brief reference for how documents are summarized depending on their type. These actions can be customized, as discussed in Section <ref id="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps" name="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps">. Some summarizers are implemented as UNIX programs while others are expressed as regular expressions; see Section <ref id="Customizing the summarizing step" name="Customizing the summarizing step"> or Section <ref id="Example 4" name="Example 4"> for more information about how to write a summarizer. <sect2>Summarizing SGML data <label id="Summarizing SGML data"> <p> It is possible to summarize documents that conform to the Standard Generalized Markup Language (SGML), for which you have a Document Type Definition (DTD). The World Wide Web's Hypertext Mark-up Language (HTML) is actually a particular application of SGML, with a corresponding DTD. (In fact, the Harvest HTML summarizer can use the HTML DTD and our SGML summarizing mechanism, which provides various advantages; see Section <ref id="The SGML-based HTML summarizer" name="The SGML-based HTML summarizer">.) SGML is being used in an increasingly broad variety of applications, for example as a format for storing data for a number of physical sciences. Because SGML allows documents to contain a good deal of structure, Harvest can summarize SGML documents very effectively. The SGML summarizer (<tt>SGML.sum</tt>) uses the <tt>sgmls</tt> program by James Clark to parse the SGML document. The parser needs both a DTD for the document and a Declaration file that describes the allowed character set. The <tt>SGML.sum</tt> program uses a table that maps SGML tags to SOIF attributes. <sect3>Location of support files <p> SGML support files can be found in <em>$HARVEST_HOME/lib/gatherer/sgmls-lib/</em>. For example, these are the default pathnames for HTML summarizing using the SGML summarizing mechanism: <tscreen><verb> $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/html.dtd $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.decl $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl </verb></tscreen> The location of the DTD file must be specified in the <tt>sgmls</tt> catalog (<em>$HARVEST_HOME/lib/gatherer/sgmls-lib/catalog</em>). For example: <tscreen><verb> DOCTYPE HTML HTML/html.dtd </verb></tscreen> The <tt>SGML.sum</tt> program looks for the <em>.decl</em> file in the default location. An alternate pathname can be specified with the <bf>-d</bf> option to <tt>SGML.sum</tt>. The summarizer looks for the <em>.sum.tbl</em> file first in the Gatherer's lib directory and then in the default location. Both of these can be overridden with the <bf>-t</bf> option to <tt>SGML.sum</tt>. <sect3>The SGML to SOIF table <p> The translation table provides a simple yet powerful way to specify how an SGML document is to be summarized. There are four ways to map SGML data into SOIF. The first two are concerned with placing the <em>content</em> of an SGML tag into a SOIF attribute. A simple SGML-to-SOIF mapping looks like this: <tscreen><verb> <TAG> soif1,soif2,... </verb></tscreen> This places the content that occurs inside the tag ``TAG'' into the SOIF attributes ``soif1'' and ``soif2''. It is possible to select different SOIF attributes based on SGML attribute values. For example, if ``ATT'' is an attribute of ``TAG'', then it would be written like this: <tscreen><verb> <TAG,ATT=x> x-stuff <TAG,ATT=y> y-stuff <TAG> stuff </verb></tscreen> The second two mappings place values of SGML attributes into SOIF attributes. To place the value of the ``ATT'' attribute of the ``TAG'' tag into the ``att-stuff'' SOIF attribute you would write: <tscreen><verb> <TAG:ATT> att-stuff </verb></tscreen> It is also possible to place the value of an SGML attribute into a SOIF attribute named by a different SOIF attribute: <tscreen><verb> <TAG:ATT1> $ATT2 </verb></tscreen> When the summarizer encounters an SGML attribute not listed in the table, the content is passed to the parent tag and becomes a part of the parent's content. To force the content of some tag <em>not</em> to be passed up, specify the SOIF attribute as ``ignore''. To force the content of some tag to be passed to the parent in addition to being placed into a SOIF attribute, list an addition SOIF attribute named ``parent''. Please see Section <ref id="The SGML-based HTML summarizer" name="The SGML-based HTML summarizer"> for examples of these mappings. <sect3>Errors and warnings from the SGML Parser <p> The <tt>sgmls</tt> parser can generate an overwhelming volume of error and warning messages. This will be especially true for HTML documents found on the Internet, which often do not conform to the strict HTML DTD. By default, errors and warnings are redirected to <em>/dev/null</em> so that they do not clutter the Gatherer's log files. To enable logging of these messages, edit the <tt>SGML.sum</tt> Perl script and set <bf>$syntax_check = 1</bf>. <sect3>Creating a summarizer for a new SGML-tagged data type <p> To create an SGML summarizer for a new SGML-tagged data type with an associated DTD, you need to do the following: <enum> <item>Write a shell script named FOO.sum which simply contains <tscreen><verb> #!/bin/sh exec SGML.sum FOO $* </verb></tscreen> <item>Modify the essence configuration files (as described in Section <ref id="Customizing the type recognition step" name="Customizing the type recognition step">) so that your documents get typed as FOO. <item>Create the directory <em>$HARVEST_HOME/lib/gatherer/sgmls-lib/FOO/</em> and copy your DTD and Declaration there as FOO.dtd and FOO.decl. Edit <em>$HARVEST_HOME/lib/gatherer/sgmls-lib/catalog</em> and add FOO.dtd to it. <item>Create the translation table FOO.sum.tbl and place it with the DTD in <em>$HARVEST_HOME/lib/gatherer/sgmls-lib/FOO/</em>. </enum> At this point you can test everything from the command line as follows: <tscreen><verb> % FOO.sum myfile.foo </verb></tscreen> <sect3>The SGML-based HTML summarizer <label id="The SGML-based HTML summarizer"> <p> Harvest can summarize HTML using the generic SGML summarizer described in Section <ref id="Summarizing SGML data" name="Summarizing SGML data">. The advantage of this approach is that the summarizer is more easily customizable, and fits with the well-conceived SGML model (where you define DTDs for individual document types and build interpretation software to understand DTDs rather than individual document types). The downside is that the summarizer is now pickier about syntax, and many Web documents are not syntactically correct. Because of this pickiness, the default is for the HTML summarizer to run with syntax checking outputs disabled. If your documents are so badly formed that they confuse the parser, this may mean the summarizing process dies unceremoniously. If you find that some of your HTML documents do not get summarized or only get summarized in part, you can turn syntax-checking output on by setting <bf>$syntax_check = 1</bf> in <tt>$HARVEST_HOME/lib/gatherer/SGML.sum</tt>. That will allow you to see which documents are invalid and where. Note that part of the reason for this problem is that Web browsers do not insist on well-formed documents. So, users can easily create documents that are not completely valid, yet display fine. Below is the default SGML-to-SOIF table used by the HTML summarizer: <tscreen><verb> HTML ELEMENT SOIF ATTRIBUTES ------------ ----------------------- <A> keywords,parent <A:HREF> url-references <ADDRESS> address <B> keywords,parent <BODY> body <CITE> references <CODE> ignore <EM> keywords,parent <H1> headings <H2> headings <H3> headings <H4> headings <H5> headings <H6> headings <HEAD> head <I> keywords,parent <IMG:SRC> images <META:CONTENT> $NAME <STRONG> keywords,parent <TITLE> title <TT> keywords,parent <UL> keywords,parent </verb></tscreen> The pathname to this file is <em>$HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl</em>. Individual Gatherers may do customized HTML summarizing by placing a modified version of this file in the Gatherer <em>lib</em> directory. Another way to customize is to modify the <tt>HTML.sum</tt> script and add a <bf>-t</bf> option to the SGML.sum command. For example: <tscreen><verb> SGML.sum -t $HARVEST_HOME/lib/my-HTML.table HTML $* </verb></tscreen> In HTML, the document title is written as: <tscreen><verb> <TITLE>My Home Page</TITLE> </verb></tscreen> The above translation table will place this in the SOIF summary as: <tscreen><verb> title{13}: My Home Page </verb></tscreen> Note that ``keywords,parent'' occurs frequently in the table. For any specially marked text (bold, emphasized, hypertext links, etc.), the words will be copied into the keywords attribute and also left in the content of the parent element. This keeps the body of the text readable by not removing certain words. Any text that appears inside a pair of CODE tags will not show up in the summary because we specified ``ignore'' as the SOIF attribute. URLs in HTML anchors are written as: <tscreen><verb> <A HREF="http://harvest.cs.colorado.edu/"> </verb></tscreen> The specification for <bf><A:HREF></bf> in the above translation table causes this to appear as: <tscreen><verb> url-references{32}: http://harvest.cs.colorado.edu/ </verb></tscreen> <sect3>Adding META data to your HTML <p> One of the most useful HTML tags is META. This allows the document writer to include arbitrary metadata in an HTML document. A Typical usage of the META element is: <tscreen><verb> <META NAME="author" CONTENT="Joe T. Slacker"> </verb></tscreen> By specifying ``<bf><META:CONTENT></bf> $NAME'' in the translation table, this comes out as: <tscreen><verb> author{15}: Joe T. Slacker </verb></tscreen> Using the META tags, HTML authors can easily add a list of keywords to their documents: <tscreen><verb> <META NAME="keywords" CONTENT="word1 word2"> <META NAME="keywords" CONTENT="word3 word4"> </verb></tscreen> <sect3>Other examples <p> A very terse HTML summarizer could be specified with a table that only puts emphasized words into the keywords attribute: <tscreen><verb> HTML ELEMENT SOIF ATTRIBUTES ------------ ----------------------- <A> keywords <B> keywords <EM> keywords <H1> keywords <H2> keywords <H3> keywords <I> keywords <META:CONTENT> $NAME <STRONG> keywords <TITLE> title,keywords <TT> keywords </verb></tscreen> Conversely, a full-text summarizer can be easily specified with only: <tscreen><verb> HTML ELEMENT SOIF ATTRIBUTES ------------ ----------------------- <HTML> full-text <TITLE> title,parent </verb></tscreen> <sect2>Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps <label id="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps"> <p> The Harvest Gatherer's actions are defined by a set of configuration and utility files, and a corresponding set of executable programs referenced by some of the configuration files. If you want to customize a Gatherer, you should create <em>bin</em> and <em>lib</em> subdirectories in the directory where you are running the Gatherer, and then copy <em>$HARVEST_HOME/lib/gatherer/*.cf</em> and <em>$HARVEST_HOME/lib/gatherer/magic</em> into your <em>lib</em> directory. Then add to your Gatherer configuration file: <tscreen><verb> Lib-Directory: lib </verb></tscreen> The details about what each of these files does are described below. The basic contents of a typical Gatherer's directory is as follows (note: some of the file names below can be changed by setting variables in the Gatherer configuration file, as described in Section <ref id="Setting variables in the Gatherer configuration file" name="Setting variables in the Gatherer configuration file">): <tscreen><verb> RunGatherd* bin/ GathName.cf log.errors tmp/ RunGatherer* data/ lib/ log.gatherer bin: MyNewType.sum* data: All-Templates.gz INFO.soif PRODUCTION.gdbm gatherd.log INDEX.gdbm MD5.gdbm gatherd.cf lib: bycontent.cf byurl.cf quick-sum.cf byname.cf magic stoplist.cf tmp: </verb></tscreen> The <tt>RunGatherd</tt> and <tt>RunGatherer</tt> are used to export the Gatherer's database after a machine reboot and to run the Gatherer, respectively. The <em>log.errors</em> and <em>log.gatherer</em> files contain error messages and the output of the <em>Essence</em> typing step, respectively (Essence will be described shortly). The <em>GathName.cf</em> file is the Gatherer's configuration file. The <em>bin</em> directory contains any summarizers and any other program needed by the summarizers. If you were to customize the Gatherer by adding a summarizer, you would place those programs in this <em>bin</em> directory; the <tt>MyNewType.sum</tt> is an example. The <em>data</em> directory contains the Gatherer's database which <tt>gatherd</tt> exports. The Gatherer's database consists of the <em>All-Templates.gz, INDEX.gdbm, INFO.soif, MD5.gdbm</em> and <em>PRODUCTION.gdbm</em> files. The <em>gatherd.cf</em> file is used to support access control as described in Section <ref id="Controlling access to the Gatherer's database" name="Controlling access to the Gatherer's database">. The <em>gatherd.log</em> file is where the <tt>gatherd</tt> program logs its information. The <em>lib</em> directory contains the configuration files used by the Gatherer's subsystems, namely Essence. These files are described briefly in the following table: <tscreen><verb> bycontent.cf Content parsing heuristics for type recognition step byname.cf File naming heuristics for type recognition step byurl.cf URL naming heuristics for type recognition step magic UNIX ``file'' command specifications (matched against bycontent.cf strings) quick-sum.cf Extracts attributes for summarizing step. stoplist.cf File types to reject during candidate selection </verb></tscreen> <sect3>Customizing the type recognition step <label id="Customizing the type recognition step"> <p> Essence recognizes types in three ways (in order of precedence): by URL naming heuristics, by file naming heuristics, and by locating <em>identifying</em> data within a file using the UNIX <tt>file</tt> command. To modify the type recognition step, edit <em>lib/byname.cf</em> to add file naming heuristics, or <em>lib/byurl.cf</em> to add URL naming heuristics, or <em>lib/bycontent.cf</em> to add by-content heuristics. The by-content heuristics match the output of the UNIX <tt>file</tt> command, so you may also need to edit the <em>lib/magic</em> file. See Section <ref id="Example 3" name="Example 3"> and <ref id="Example 4" name="Example 4"> for detailed examples on how to customize the type recognition step. <sect3>Customizing the candidate selection step <label id="Customizing the candidate selection step"> <p> The <em>lib/stoplist.cf</em> configuration file contains a list of types that are rejected by Essence. You can add or delete types from <em>lib/stoplist.cf</em> to control the candidate selection step. To direct Essence to index only certain types, you can list the types to index in <em>lib/allowlist.cf</em>. Then, supply Essence with the <bf>--allowlist</bf> flag. The file and URL naming heuristics used by the type recognition step (described in Section <ref id="Customizing the type recognition step" name="Customizing the type recognition step">) are particularly useful for candidate selection when gathering remote data. They allow the Gatherer to avoid retrieving files that you don't want to index (in contrast, recognizing types by locating identifying data within a file requires that the file be retrieved first). This approach can save quite a bit of network traffic, particularly when used in combination with enumerated <em>RootNode</em> URLs. For example, many sites provide each of their files in both a compressed and uncompressed form. By building a <em>lib/allowlist.cf</em> containing only the Compressed types, you can avoid retrieving the uncompressed versions of the files. <sect3>Customizing the presentation unnesting step <label id="Customizing the presentation unnesting step"> <p> Some types are declared as ``nested'' types. Essence treats these differently than other types, by running a presentation unnesting algorithm or ``Exploder'' on the data rather than a Summarizer. At present Essence can handle files nested in the following formats: <enum> <item>binhex <item>uuencode <item>shell archive (``shar'') <item>tape archive (``tar'') <item>bzip2 compressed (``bzip2'') <item>compressed <item>GNU compressed (``gzip'') <item>zip compressed archive </enum> To customize the presentation unnesting step you can modify the Essence source file <em>src/gatherer/essence/unnest.c</em>. This file lists the available presentation encodings, and also specifies the unnesting algorithm. Typically, an external program is used to unravel a file into one or more component files (e.g. <tt>bzip2, gunzip, uudecode,</tt> and <tt>tar</tt>). An <em>Exploder</em> may also be used to explode a file into a stream of SOIF objects. An Exploder program takes a URL as its first command-line argument and a file containing the data to use as its second, and then generates one or more SOIF objects as output. For your convenience, the <em>Exploder</em> type is already defined as a nested type. To save some time, you can use this type and its corresponding <tt>Exploder.unnest</tt> program rather than modifying the Essence code. See Section <ref id="Example 2" name="Example 2"> for a detailed example on writing an Exploder. The <em>unnest.c</em> file also contains further information on defining the unnesting algorithms. <sect3>Customizing the summarizing step <label id="Customizing the summarizing step"> <p> Essence supports two mechanisms for defining the type-specific extraction algorithms (called <em>Summarizers</em>) that generate content summaries: a UNIX program that takes as its only command line argument the filename of the data to summarize, and line-based regular expressions specified in <em>lib/quick-sum.cf</em>. See Section <ref id="Example 4" name="Example 4"> for detailed examples on how to define both types of Summarizers. The UNIX Summarizers are named using the convention <tt>TypeName.sum</tt> (e.g., <tt>PostScript.sum</tt>). These Summarizers output their content summary in a SOIF attribute-value list (see Section <ref id="The Summary Object Interchange Format (SOIF)" name="The Summary Object Interchange Format (SOIF)">). You can use the <tt>wrapit</tt> command to wrap raw output into the SOIF format (i.e., to provide byte-count delimiters on the individual attribute-value pairs). There is a summarizer called <tt>FullText.sum</tt> that you can use to perform full text indexing of selected file types, by simply setting up the <em>lib/bycontent.cf</em> and <em>lib/byname.cf</em> configuration files to recognize the desired file types as FullText (i.e., using ``FullText'' in column 1 next to the matching regular expression). <sect1>Post-Summarizing: Rule-based tuning of object summaries <p> It is possible to ``fine-tune'' the summary information generated by the Essence summarizers. A typical application of this would be to change the <em>Time-to-Live</em> attribute based on some knowledge about the objects. So an administrator could use the post-summarizing feature to give quickly-changing objects a lower TTL, and very stable documents a higher TTL. Objects are selected for post-summarizing if they meet a specified condition. A condition consists of three parts: An attribute name, an operation, and some string data. For example: <tscreen><verb> city == 'New York' </verb></tscreen> In this case we are checking if the <em>city</em> attribute is equal to the string `New York'. For exact string matching, the string data must be enclosed in single quotes. Regular expressions are also supported: <tscreen><verb> city ~ /New York/ </verb></tscreen> Negative operators are also supported: <tscreen><verb> city != 'New York' city !~ /New York/ </verb></tscreen> Conditions can be joined with `<bf>&&</bf>' (logical and) or `<bf>||</bf>' (logical or) operators: <tscreen><verb> city == 'New York' && state != 'NY'; </verb></tscreen> When all conditions are met for an object, some number of instructions are executed on it. There are four types of instructions which can be specified: <enum> <item>Set an attribute exactly to some specific string. Example: <tscreen><verb> time-to-live = "86400" </verb></tscreen> <item>Filter an attribute through some program. The attribute value is given as input to the filter. The output of the filter becomes the new attribute value. Example: <tscreen><verb> keywords | tr A-Z a-z </verb></tscreen> <item>Filter multiple attributes through some program. In this case the filter must read and write attributes in the SOIF format. Example: <tscreen><verb> address,city,state,zip ! cleanup-address.pl </verb></tscreen> <item>A special case instruction is to delete an object. To do this, simply write: <tscreen><verb> delete() </verb></tscreen> </enum> <sect2>The Rules file <p> The conditions and instructions are combined together in a ``rules'' file. The format of this file is somewhat similar to a Makefile; conditions begin in the first column and instructions are indented by a tab-stop. Example: <tscreen><verb> type == 'HTML' partial-text | cleanup-html-text.pl URL ~ /users/ time-to-live = "86400" partial-text ! extract-owner.sh type == 'SOIFStream' delete() </verb></tscreen> This rules file is specified in the gatherer.cf file with the Post-Summarizing tag, e.g.: <tscreen><verb> Post-Summarizing: lib/myrules </verb></tscreen> <sect2>Rewriting URLs <p> Until version 1.4 it was not possible to rewrite the URL-part of an object summary. It is now possible, but only by using the ``pipe'' instruction. This may be useful for people wanting to run a Gatherer on <em>file://</em> URLs, but have them appear as <em>http://</em> URLs. This can be done with a post-summarizing rule such as: <tscreen><verb> url ~ 'file://localhost/web/htdocs/' url | fix-url.pl </verb></tscreen> And the 'fix-url.pl' script might look like: <tscreen><verb> #!/usr/local/bin/perl -p s'file://localhost/web/htdocs/'http://www.my.domain/'; </verb></tscreen> <sect1>Gatherer administration <p> <sect2>Setting variables in the Gatherer configuration file <label id="Setting variables in the Gatherer configuration file"> <p> In addition to customizing the steps described in Section <ref id="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps" name="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps">, you can customize the Gatherer by setting variables in the Gatherer configuration file. This file consists of two parts: a list of variables that specify information about the Gatherer (such as its name, host, and port number), and two lists of URLs (divided into <bf>RootNodes</bf> and <bf>LeafNodes</bf>) from which to collect indexing information. Section <ref id="Basic setup" name="Basic setup"> shows an example Gatherer configuration file. In this section we focus on the variables that the user can set in the first part of the Gatherer configuration file. Each variable name starts in the first column, ends with a colon, then is followed by the value. The following table shows the supported variables: <tscreen><verb> Access-Delay: Default delay between URLs accesses. Data-Directory: Directory where GDBM database is written. Debug-Options: Debugging options passed to child programs. Errorlog-File: File for logging errors. Essence-Options: Any extra options to pass to Essence. FTP-Auth: Username/password for protected FTP documents. Gatherd-Inetd: Denotes that gatherd is run from inetd. Gatherer-Host: Full hostname where the Gatherer is run. Gatherer-Name: A Unique name for the Gatherer. Gatherer-Options: Extra options for the Gatherer. Gatherer-Port: Port number for gatherd. Gatherer-Version: Version string for the Gatherer. HTTP-Basic-Auth: Username/password for protected HTTP documents. HTTP-Proxy: host:port of your HTTP proxy. Keep-Cache: ``yes'' to not remove local disk cache. Lib-Directory: Directory where configuration files live. Local-Mapping: Mapping information for local gathering. Log-File: File for logging progress. Post-Summarizing: A rules-file for post-summarizing. Refresh-Rate: Object refresh-rate in seconds, default 1 week. Time-To-Live: Object time-to-live in seconds, default 1 month. Top-Directory: Top-level directory for the Gatherer. Working-Directory: Directory for tmp files and local disk cache. </verb></tscreen> Notes: <itemize> <item>We recommend that you use the <bf>Top-Directory</bf> variable, since it will set the <bf>Data-Directory</bf>, <bf>Lib-Directory</bf>, and <bf>Working-Directory</bf> variables. <item>Both <bf>Working-Directory</bf> and <bf>Data-Directory</bf> will have files in them after the Gatherer has run. The <bf>Working-Directory</bf> will hold the local-disk cache that the Gatherer uses to reduce network I/O, and the <bf>Data-Directory</bf> will hold the GDBM databases that contain the content summaries. <item>You should use full rather than relative pathnames. <item>All variable definitions <em>must</em> come before the RootNode or LeafNode URLs. <item>Any line that starts with a ``#'' is a comment. <item><bf>Local-Mapping</bf> is discussed in Section <ref id="Local file system gathering for reduced CPU load" name="Local file system gathering for reduced CPU load">. <item><bf>HTTP-Proxy</bf> will retrieve HTTP URLs via a proxy host. The syntax is <bf>hostname:port</bf>; for example, <bf>proxy.yoursite.com:3128</bf>. <item><bf>Essence-Options</bf> is particularly useful, as it lets you customize basic aspects of the Gatherer easily. <item>The only valid <bf>Gatherer-Options</bf> is <bf>--save-space</bf> which directs the Gatherer to be more space efficient when preparing its database for export. <item>The <tt>Gatherer</tt> program will accept the <bf>-background</bf> flag which will cause the Gatherer to run in the background. </itemize> The Essence options are: <tscreen><verb> Option Meaning -------------------------------------------------------------------- --allowlist filename File with list of types to allow --fake-md5s Generates MD5s for SOIF objects from a .unnest program --fast-summarizing Trade speed for some consistency. Use only when an external summarizer is known to generate clean, unique attributes. --full-text Use entire file instead of summarizing. Alternatively, you can perform full text indexing of individual file types by using the FullText.sum summarizer. --max-deletions n Number of GDBM deletions before reorganization --minimal-bookkeeping Generates a minimal amount of bookkeeping attrs --no-access Do not read contents of objects --no-keywords Do not automatically generate keywords --stoplist filename File with list of types to remove --type-only Only type data; do not summarize objects </verb></tscreen> A particular note about full text summarizing: Using the Essence <bf>--full-text</bf> option causes files not to be passed through the Essence content extraction mechanism. Instead, their entire content is included in the SOIF summary stream. In some cases this may produce unwanted results (e.g., it will directly include the PostScript for a document rather than first passing the data through a PostScript to text extractor, providing few searchable terms and large SOIF objects). Using the individual file type summarizing mechanism described in Section <ref id="Customizing the summarizing step" name="Customizing the summarizing step"> will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence <bf>--full-text</bf> option to perform content extraction before including the full text of documents. <sect2>Local file system gathering for reduced CPU load <label id="Local file system gathering for reduced CPU load"> <p> Although the Gatherer's work load is specified using URLs, often the files being gathered are located on a local file system. In this case it is much more efficient to gather directly from the local file system than via FTP/Gopher/HTTP/News, primarily because of all the UNIX forking required to gather information via these network processes. For example, our measurements indicate it causes from 4-7x more CPU load to gather from FTP than directly from the local file system. For large collections (e.g., archive sites containing many thousands of files), the CPU savings can be considerable. Starting with Harvest Version 1.1, it is possible to tell the Gatherer how to translate URLs to local file system names, using the <bf>Local-Mapping</bf> Gatherer configuration file variable (see Section <ref id="Setting variables in the Gatherer configuration file" name="Setting variables in the Gatherer configuration file">). The syntax is: <tscreen><verb> Local-Mapping: URL_prefix local_path_prefix </verb></tscreen> This causes all URLs starting with <bf>URL_prefix</bf> to be translated to files starting with the prefix <bf>local_path_prefix</bf> while gathering, but to be left as URLs in the results of queries (so the objects can be retrieved as usual). Note that no regular expressions are supported here. As an example, the specification <tscreen><verb> Local-Mapping: http://harvest.cs.colorado.edu/~hardy/ /homes/hardy/public_html/ Local-Mapping: ftp://ftp.cs.colorado.edu/pub/cs/ /cs/ftp/ </verb></tscreen> would cause the URL <em>http://harvest.cs.colorado.edu/~hardy/Home.html</em> to be translated to the local file name <em>/homes/hardy/public_html/Home.html</em>, while the URL <em>ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/Harvest.Conf.ps.Z</em> would be translated to the local file name <em>/cs/ftp/techreports/schwartz/Harvest.Conf.ps.Z</em>. Local gathering will work over NFS file systems. A local mapping will fail if: the local file cannot be opened for reading; or the local file is not a regular file; or the local file has execute bits set. So, for directories, symbolic links and CGI scripts, the server is always contacted rather than the local file system. Lastly, the Gatherer does not perform any URL syntax translations for local mappings. If your URL has characters that should be escaped (as in <htmlurl url="http://www.ietf.org/rfc/rfc1738.txt" name="RFC1738">), then the local mapping will fail. Starting with version 1.4 patchlevel 2 Essence will print <em>[L]</em> after URLs which were successfully accessed locally. Note that if your network is highly congested, it may actually be faster to gather via HTTP/FTP/Gopher than via NFS, because NFS becomes very inefficient in highly congested situations. Even better would be to run local Gatherers on the hosts where the disks reside, and access them directly via the local file system. <sect2>Gathering from password-protected servers <p> You can gather password-protected documents from HTTP and FTP servers. In both cases, you can specify a username and password as a part of the URL. The format is as follows: <tscreen><verb> ftp://user:password@host:port/url-path http://user:password@host:port/url-path </verb></tscreen> With this format, the ``user:password'' part is kept as a part of the URL string all throughout Harvest. This may enable anyone who uses your Broker(s) to access password-protected documents. You can keep the username and password information ``hidden'' by specifying the authentication information in the Gatherer configuration file. For HTTP, the format is as follows: <tscreen><verb> HTTP-Basic-Auth: realm username password </verb></tscreen> where <bf>realm</bf> is the same as the <bf>AuthName</bf> parameter given in an Apache httpd <em>httpd.conf</em> or <em>.htaccess</em> file. In other httpd server configuration, the realm value is sometimes called <bf>ServerId</bf>. For FTP, the format in the gatherer.cf file is <tscreen><verb> FTP-Auth: hostname[:port] username password </verb></tscreen> <sect2>Controlling access to the Gatherer's database <label id="Controlling access to the Gatherer's database"> <p> You can use the <em>gatherd.cf</em> file (placed in the <bf>Data-Directory</bf> of a Gatherer) to control access to the Gatherer's database. A line that begins with <bf>Allow</bf> is followed by any number of domain or host names that are allowed to connect to the Gatherer. If the word <bf>all</bf> is used, then all hosts are matched. <bf>Deny</bf> is the opposite of <bf>Allow</bf>. The following example will only allow hosts in the <bf>cs.colorado.edu</bf> or <bf>usc.edu</bf> domain access the Gatherer's database: <tscreen><verb> Allow cs.colorado.edu usc.edu Deny all </verb></tscreen> <sect2>Periodic gathering and realtime updates <label id="Periodic gathering and realtime updates"> <p> The <tt>Gatherer</tt> program does not automatically do any periodic updates -- when you run it, it processes the specified URLs, starts up a <tt>gatherd</tt> daemon (if one isn't already running), and then exits. If you want to update the data periodically (e.g., to capture new files as they are added to an FTP archive), you need to use the UNIX <tt>cron</tt> command to run the <tt>Gatherer</tt> program at some regular interval. To set up periodic gathering via <tt>cron</tt>, use the <tt>RunGatherer</tt> command that <tt>RunHarvest</tt> will create. An example <tt>RunGatherer</tt> script follows: <tscreen><verb> #!/bin/sh # # RunGatherer - Runs the ATT 800 Gatherer (from cron) # HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME PATH=${HARVEST_HOME}/bin:${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/lib:$PATH export PATH NNTPSERVER=localhost; export NNTPSERVER cd /usr/local/harvest/gatherers/att800 exec Gatherer "att800.cf" </verb></tscreen> You should run the <tt>RunGatherd</tt> command from your system startup (e.g. <em>/etc/rc.local</em>) file, so the Gatherer's database is exported each time the machine reboots. An example <tt>RunGatherd</tt> script follows: <tscreen><verb> #!/bin/sh # # RunGatherd - starts up the gatherd process (from /etc/rc.local) # HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME PATH=${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/bin:$PATH; export PATH exec gatherd -d /usr/local/harvest/gatherers/att800/data 8500 </verb></tscreen> <sect2>The local disk cache <label id="The local disk cache"> <p> The Gatherer maintains a local disk cache of files it gathers to reduce network traffic from restarting aborted gathering attempts. However, since the remote server must still be contacted whenever <tt>Gatherer</tt> runs, please do not set your cron job to run <tt>Gatherer</tt> frequently. A typical value might be weekly or monthly, depending on how congested the network and how important it is to have the most current data. By default, the Gatherer's local disk cache is deleted after each successful completion. To save the local disk cache between Gatherer sessions, define <bf>Keep-Cache: yes</bf> in your Gatherer configuration file (Section <ref id="Setting variables in the Gatherer configuration file" name="Setting variables in the Gatherer configuration file">). If you want your Broker's index to reflect new data, then you must run the Gatherer <em>and</em> run a Broker collection. By default, a Broker will perform collections once a day. If you want the Broker to collect data as soon as it's gathered, then you will need to coordinate the timing of the completion of the Gatherer and the Broker collections. If you run your Gatherer frequently and you use the <bf>Keep-Cache: yes</bf> in your Gatherer configuration file, then the Gatherer's local disk cache may interfere with retrieving updates. By default, objects in the local disk cache expire after 7 days; however, you can expire objects more quickly by setting the <bf>$GATHERER_CACHE_TTL</bf> environment variable to the number of seconds for the Time-To-Live (TTL) before you run the Gatherer, or you can change <tt>RunGatherer</tt> to remove the Gatherer's <em>tmp</em> directory after each Gatherer run. For example, to expire objects in the local disk cache after one day: <tscreen><verb> % setenv GATHERER_CACHE_TTL 86400 # one day % ./RunGatherer </verb></tscreen> The Gatherer's local disk cache size defaults to 32 MBs, but you can change this value by setting the <bf>$HARVEST_MAX_LOCAL_CACHE</bf> environment variable to the number of MBs before you run the Gatherer. For example, to have a maximum cache of 10 MB you can do as follows: <tscreen><verb> % setenv HARVEST_MAX_LOCAL_CACHE 10 # 10 MB % ./RunGatherer </verb></tscreen> If you have access to the software that creates the files that you are indexing (e.g., if all updates are funneled through a particular editor, update script, or system call), you can modify this software to schedule realtime Gatherer updates whenever a file is created or updated. For example, if all users update the files being indexed using a particular program, this program could be modified to run the Gatherer upon completion of the user's update. Note that, when used in conjunction with <tt>cron</tt>, the Gatherer provides a powerful data ``mirroring'' facility. You can use the Gatherer to replicate the contents of one or more sites, retrieve data in multiple formats via multiple protocols (FTP, HTTP, etc.), optionally perform a variety of type- or site-specific transformations on the data, and serve the results very efficiently as compressed SOIF object summary streams to other sites that wish to use the data for building indexes or for other purposes. <sect2>Incorporating manually generated information into a Gatherer <label id="Incorporating manually generated information into a Gatherer"> <p> You may want to inspect the quality of the automatically-generated SOIF templates. In general, Essence's techniques for automatic information extraction produce imperfect results. Sometimes it is possible to customize the summarizers to better suit the particular context (see Section <ref id="Customizing the summarizing step" name="Customizing the summarizing step">). Sometimes, however, it makes sense to augment or change the automatically generated keywords with manually entered information. For example, you may want to add <em>Title</em> attributes to the content summaries for a set of PostScript documents (since it's difficult to parse them out of PostScript automatically). Harvest provides some programs that automatically clean up a Gatherer's database. The <tt>rmbinary</tt> program removes any binary data from the templates. The <tt>cleandb</tt> program does some simple validation of SOIF objects, and when given the <bf>-truncate</bf> flag it will truncate the <em>Keywords</em> data field to 8 kilobytes. To help in manually managing the Gatherer's databases, the <tt>gdbmutil</tt> GDBM database management tool is provided in <em>$HARVEST_HOME/lib/gatherer</em>. In a future release of Harvest we will provide a forms-based mechanism to make it easy to provide manual annotations. In the meantime, you can annotate the Gatherer's database with manually generated information by using the <tt>mktemplate</tt>, <tt>template2db</tt>, <tt>mergedb</tt>, and <tt>mkindex</tt> programs. You first need to create a file (called, say, <em>annotations</em>) in the following format: <tscreen><verb> @FILE { url1 Attribute-Name-1: DATA Attribute-Name-2: DATA ... Attribute-Name-n: DATA } @FILE { url2 Attribute-Name-1: DATA Attribute-Name-2: DATA ... Attribute-Name-n: DATA } ... </verb></tscreen> Note that the <em>Attributes</em> must begin in column 0 and have one tab after the colon, and the <em>DATA</em> must be on a single line. Next, run the <tt>mktemplate</tt> and <tt>template2db</tt> programs to generate SOIF and then GDBM versions of these data (you can have several files containing the annotations, and generate a single GDBM database from the above commands): <tscreen><verb> % set path = ($HARVEST_HOME/lib/gatherer $path) % mktemplate annotations [annotations2 ...] | template2db annotations.gdbm </verb></tscreen> Finally, you run <tt>mergedb</tt> to incorporate the annotations into the automatically generated data, and <tt>mkindex</tt> to generate an index for it. The usage line for <tt>mergedb</tt> is: <tscreen><verb> mergedb production automatic manual [manual ...] </verb></tscreen> The idea is that <em>production</em> is the final GDBM database that the Gatherer will serve. This is a <em>new</em> database that will be generated from the other databases on the command line. <em>automatic</em> is the GDBM database that a Gatherer automatically generated in a previous run (e.g., <em>WORKING.gdbm</em> or a previous <em>PRODUCTION.gdbm</em>). <em>manual</em> and so on are the GDBM databases that you manually created. When mergedb runs, it builds the <em>production</em> database by first copying the templates from the <em>manual</em> databases, and then merging in the attributes from the <em>automatic</em> database. In case of a conflict (the same attribute with different values in the <em>manual</em> and <em>automatic</em> databases), the <em>manual</em> values override the <em>automatic</em> values. By keeping the automatically and manually generated data stored separately, you can avoid losing the manual updates when doing periodic automatic gathering. To do this, you will need to set up a script to remerge the manual annotations with the automatically gathered data after each gathering. An example use of <tt>mergedb</tt> is: <tscreen><verb> % mergedb PRODUCTION.new PRODUCTION.gdbm annotations.gdbm % mv PRODUCTION.new PRODUCTION.gdbm % mkindex </verb></tscreen> If the manual database looked like this: <tscreen><verb> @FILE { url1 my-manual-attribute: this is a neat attribute } </verb></tscreen> and the automatic database looked like this: <tscreen><verb> @FILE { url1 keywords: boulder colorado file-size: 1034 md5: c3d79dc037efd538ce50464089af2fb6 } </verb></tscreen> then in the end, the production database will look like this: <tscreen><verb> @FILE { url1 my-manual-attribute: this is a neat attribute keywords: boulder colorado file-size: 1034 md5: c3d79dc037efd538ce50464089af2fb6 } </verb></tscreen> <sect1>Troubleshooting <p> <descrip> <tag/Debugging/ Extra information from specific programs and library routines can be logged by setting debugging flags. A debugging flag has the form <bf>-Dsection,level</bf>. <em>Section</em> is an integer in the range 1-255, and <em>level</em> is an integer in the range 1-9. Debugging flags can be given on a command line, with the <bf>Debug-Options:</bf> tag in a gatherer configuration file, or by setting the environment variable <bf>$HARVEST_DEBUG</bf>. Examples: <tscreen><verb> Debug-Options: -D68,5 -D44,1 % httpenum -D20,1 -D21,1 -D42,1 http://harvest.cs.colorado.edu/ % setenv HARVEST_DEBUG '-D20,1 -D23,1 -D63,1' </verb></tscreen> Debugging sections and levels have been assigned to the following sections of the code: <tscreen><verb> section 20, level 1, 5, 9 Common liburl URL processing section 21, level 1, 5, 9 Common liburl HTTP routines section 22, level 1, 5 Common liburl disk cache routines section 23, level 1 Common liburl FTP routines section 24, level 1 Common liburl Gopher routines section 25, level 1 urlget - standalone liburl program. section 26, level 1 ftpget - standalone liburl program. section 40, level 1, 5, 9 Gatherer URL enumeration section 41, level 1 Gatherer enumeration URL verification section 42, level 1, 5, 9 Gatherer enumeration for HTTP section 43, level 1, 5, 9 Gatherer enumeration for Gopher section 44, level 1, 5 Gatherer enumeration filter routines section 45, level 1 Gatherer enumeration for FTP section 46, level 1 Gatherer enumeration for file:// URLs section 48, level 1, 5 Gatherer enumeration robots.txt stuff section 60, level 1 Gatherer essence data object processing section 61, level 1 Gatherer essence database routines section 62, level 1 Gatherer essence main section 63, level 1 Gatherer essence type recognition section 64, level 1 Gatherer essence object summarizing section 65, level 1 Gatherer essence object unnesting section 66, level 1, 2, 5 Gatherer essence post-summarizing section 67, level 1 Gatherer essence object-ID code section 69, level 1, 5, 9 Common SOIF template processing section 70, level 1, 5, 9 Broker registry section 71, level 1 Broker collection routines section 72, level 1 Broker SOIF parsing routines section 73, level 1, 5, 9 Broker registry hash tables section 74, level 1 Broker storage manager routines section 75, level 1, 5 Broker query manager routines section 75, level 4 Broker query_list debugging section 76, level 1 Broker event management routines section 77, level 1 Broker main section 78, level 9 Broker select(2) loop section 79, level 1, 5, 9 Broker gatherer-id management section 80, level 1 Common utilities memory management section 81, level 1 Common utilities buffer routines section 82, level 1 Common utilities system(3) routines section 83, level 1 Common utilities pathname routines section 84, level 1 Common utilities hostname processing section 85, level 1 Common utilities string processing section 86, level 1 Common utilities DNS host cache section 101, level 1 Broker PLWeb indexing engine section 102, level 1, 2, 5 Broker Glimpse indexing engine section 103, level 1 Broker Swish indexing engine </verb></tscreen> <tag/Symptom/ The Gatherer <em>doesn't pick up all the objects</em> pointed to by some of my RootNodes. <tag/Solution/ The Gatherer places various limits on enumeration to prevent a misconfigured Gatherer from abusing servers or running wildly. See section <ref id="RootNode specifications" name="RootNode specifications"> for details on how to override these limits. <tag/Symptom/ <em>Local-Mapping did not work</em> for me - it retrieved the objects via the usual remote access protocols. <tag/Solution/ A local mapping will fail if: <itemize> <item>the local filename cannot be opened for reading; or, <item>the local filename is not a regular file; or, <item>the local filename has execute bits set. </itemize> So for directories, symbolic links, and CGI scripts, the HTTP server is always contacted. We don't perform URL translation for local mappings. If your URL's have funny characters that must be escaped, then the local mapping will also fail. Add debug option <bf>-D20,1</bf> to understand how local mappings are taking place. <tag/Symptom/ Using the <bf>--full-text</bf> option I see a lot of <em>raw data</em> in the content summaries, with few keywords I can search. <tag/Solution/ At present <bf>--full-text</bf> simply includes the full data content in the SOIF summaries. Using the individual file type summarizing mechanism described in Section <ref id="Customizing the summarizing step" name="Customizing the summarizing step"> will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence <bf>--full-text</bf> option to perform content extraction before including the full text of documents. <tag/Symptom/ No indexing terms are being generated in the SOIF summary for the META tags in my HTML documents. <tag/Solution/ This probably indicates that your HTML is not syntactically well-formed, and hence the SGML-based HTML summarizer is not able to recognize it. See Section <ref id="Summarizing SGML data" name="Summarizing SGML data"> for details and debugging options. <tag/Symptom/ Gathered data are <em>not being updated</em>. <tag/Solution/ The Gatherer does not automatically do periodic updates. See Section <ref id="Periodic gathering and realtime updates" name="Periodic gathering and realtime updates"> for details. <tag/Symptom/ The Gatherer puts <em>slightly different URLs</em> in the <em>SOIF</em> summaries than I specified in the Gatherer <em>configuration file</em>. <tag/Solution/ This happens because the Gatherer attempts to put URLs into a canonical format. It does this by removing default port numbers and similar cosmetic changes. Also, by default, Essence (the content extraction subsystem within the Gatherer) removes the standard stoplist.cf types, which includes HTTP-Query (the cgi-bin stuff). <tag/Symptom/ There are <em>no Last-Modification-Time</em> or <em>MD5 attributes</em> in my gatherered SOIF data, so the Broker can't do duplicate elimination. <tag/Solution/ If you gather remote, manually-created information, it is pulled into Harvest using ``exploders'' that translate from the remote format into SOIF. That means they don't have a direct way to fill in the Last-Modification-Time or MD5 information per record. Note also that this will mean one update to the remote records would cause all records to look updated, which will result in more network load for Brokers that collect from this Gatherer's data. As a solution, you can compute MD5s for all objects, and store them as part of the record. Then, when you run the exploder you only generate timestamps for the ones for which the MD5s changed - giving you real last-modification times. <tag/Symptom/ The Gatherer substitutes a ``%7e'' for a ``~'' in all the user directory URLs. <tag/Solution/ The Gatherer conforms to <htmlurl url="http://www.ietf.org/rfc/rfc1738.txt" name="RFC1738">, which says that a tilde inside a URL should be encoded as ``%7e'', because it is considered an ``unsafe'' character. <tag/Symptom/ When I search using keywords I know are in a document I have indexed with Harvest, the <em>document isn't found</em>. <tag/Solution/ Harvest uses a content extraction subsystem called <em>Essence</em> that by default does not extract every keyword in a document. Instead, it uses heuristics to try to select promising keywords. You can change what keywords are selected by customizing the summarizers for that type of data, as discussed in Section <ref id="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps" name="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps">. Or, you can tell <em>Essence</em> to use full text summarizing if you feel the added disk space costs are merited, as discussed in Section <ref id="Setting variables in the Gatherer configuration file" name="Setting variables in the Gatherer configuration file">. <tag/Symptom/ I'm running Harvest on HP-UX, but the <tt>essence</tt> process in the Gatherer <em>takes too much memory</em>. <tag/Solution/ The supplied regular expression library has memory leaks on HP-UX, so you need to use the regular expression library supplied with HP-UX. Change the <em>Makefile</em> in <em>src/gatherer/essence</em> to read: <tscreen><verb> REGEX_DEFINE = -DUSE_POSIX_REGEX REGEX_INCLUDE = REGEX_OBJ = REGEX_TYPE = posix </verb></tscreen> <tag/Symptom/ I built the configuration files to <em>customize</em> how Essence types/content extracts data, but it <em>uses the standard typing/extracting</em> mechanisms anyway. <tag/Solution/ Verify that you have the <bf>Lib-Directory</bf> set to the <em>lib/</em> directory that you put your configuration files. <bf>Lib-Directory</bf> is defined in your Gatherer configuration file. <tag/Symptom/ I am having problems <em>resolving host names</em> on SunOS. <tag/Solution/ In order to gather data from hosts outside of your organization, your system must be able to resolve fully qualified domain names into IP addresses. If your system cannot resolve hostnames, you will see error messages such as ``Unknown Host.'' In this case, either: <itemize> <item>the hostname you gave does not really exist; or <item>your system is not configured to use the DNS. </itemize> To verify that your system is configured for DNS, make sure that the file <em>/etc/resolv.conf</em> exists and is readable. Read the resolv.conf(5) manual page for information on this file. You can verify that DNS is working with the <tt>nslookup</tt> command. Some sites may use Sun Microsystem's Network Information Service (NIS) instead of, or in addition to, DNS. We believe that Harvest works on systems where NIS has been properly configured. The NIS servers (the names of which you can determine from the <tt>ypwhich</tt> command) must be configured to query DNS servers for hostnames they do not know about. See the <bf>-b</bf> option of the <tt>ypxfr</tt> command. <tag/Symptom/ I cannot get the Gatherer to work across our <em>firewall gateway</em>. <tag/Solution/ Harvest only supports retrieving HTTP objects through a proxy. It is not yet possible to request Gopher and FTP objects through a firewall. For these objects, you may need to run Harvest internally (behind the firewall) or on the firewall host itself. If you see the ``Host is unreachable'' message, these are the likely problems: <itemize> <item>your connection to the Internet is temporarily down due to a circuit or routing failure; or <item>you are behind a firewall. </itemize> If you see the ``Connection refused'' message, the likely problem is that you are trying to connect with an unused port on the destination machine. In other words, there is no program listening for connections on that port. The Harvest gatherer is essentially a WWW client. You should expect it to work the same as any Web browser. </descrip> <sect>The Broker <label id="The Broker"> <p> <sect1>Overview <p> The Broker retrieves and manages indexing information from Gatherers and other Brokers, and provides a WWW query interface to the indexing information. <sect1>Basic setup <p> The Broker is automatically started by the <tt>RunHarvest</tt> command. Other relevant commands are described in Section <ref id="Starting up the system: RunHarvest and related commands" name="Starting up the system: RunHarvest and related commands">. In the current section we discuss various ways users can customize and tune the Broker, how to administrate the Broker, and the various Broker programming interfaces. As suggested in Figure <ref id="img1" name="1">, the Broker uses a flexible indexing interface that supports a variety of indexing subsystems. The default Harvest Broker uses <htmlurl url="http://webglimpse.org/gdocs.html" name="Glimpse"> as indexer, but other indexers such as Swish, and WAIS (both <url url="ftp://ftp.cnidr.org/pub/software/freewais/" name="freeWAIS and commercial WAIS">), also work with the Broker (see Section <ref id="Using different index/search engines with the Broker" name="Using different index/search engines with the Broker">). To create a new Broker, run the <tt>CreateBroker</tt> program. It will ask you a series of questions about how you'd like to configure your Broker, and then automatically create and configure it. To start your Broker, use the <tt>RunBroker</tt> program that <tt>CreateBroker</tt> generates. The Broker should be started when your system reboots. To prevent a collection while starting the broker, use the <bf>-nocol</bf> option. There are a number of ways you can customize or tune the Broker, discussed in Sections <ref id="Tuning Glimpse indexing in the Broker" name="Tuning Glimpse indexing in the Broker"> and <ref id="Using different index/search engines with the Broker" name="Using different index/search engines with the Broker">. You may also use the <tt>RunHarvest</tt> command, discussed in Section <ref id="Starting up the system: RunHarvest and related commands" name="Starting up the system: RunHarvest and related commands">, to create both a Broker and a Gatherer. <sect1>Querying a Broker <label id="Querying a Broker"> <p> The Harvest Broker can handle many types of queries. The queries handled by a particular Broker depend on what index/search engine is being used inside of it (e.g., WAIS does not support some of the queries that Glimpse does). In this section we describe the full syntax. If a particular Broker does not support a certain type of query, it will return an error when the user requests that type of query. The simplest query is a single keyword, such as: <tscreen><verb> lightbulb </verb></tscreen> Searching for common words (like ``computer'' or ``html'') may take a lot of time. Particularly for large Brokers, it is often helpful to use more powerful queries. Harvest supports many different index/search engines, with varying capabilities. At present, our most powerful (and commonly used) search engine is <htmlurl url="http://webglimpse.org/gdocs.html" name="Glimpse">, which supports: <itemize> <item>case-insensitive and case-sensitive queries; <item>matching parts of words, whole words, or multiple word phrases (like ``resource discovery''); <item>Boolean (AND/OR) combinations of keywords; <item>approximate matches (e.g., allowing spelling errors); <item>structured queries (which allow you to constrain matches to certain attributes); <item>displaying matched lines or entire matching records (e.g., for citations); <item>specifying limits on the number of matches returned; and <item>a limited form of regular expressions (e.g., allowing ``wild card'' expressions that match all words ending in a particular suffix). </itemize> The different types of queries (and how to use them) are discussed below. Note that you use the same syntax regardless of what index/search engine is running in a particular Broker, but that not all engines support all of the above features. In particular, some of the Brokers use WAIS, which sometimes searches faster than Glimpse but supports only Boolean keyword queries and the ability to specify result set limits. The different options - case-sensitivity, approximate matching, the ability to show matched lines vs. entire matching records, and the ability to specify match count limits - can all be specified with buttons and menus in the Broker query forms. A structured query has the form: <tscreen><verb> tag-name : value </verb></tscreen> where <em>tag-name</em> is a Content Summary attribute name, and <em>value</em> is the search value within the attribute. If you click on a Content Summary, you will see what attributes are available for a particular Broker. A list of common attributes is shown in Section <ref id="List of common SOIF attribute names" name="List of common SOIF attribute names">. Keyword searches and structured queries can be combined using Boolean operators (AND and OR) to form complex queries. Lacking parentheses, logical operation precedence is based left to right. For multiple word phrases or regular expressions, you need to enclose the string in double quotes, e.g., <tscreen><verb> "internet resource discovery" </verb></tscreen> or <tscreen><verb> "discov.*" </verb></tscreen> Double quotes should also be used when searching for non-alphanumeric characters. <sect2>Example queries <p> <descrip> <tag/Simple keyword search query:/ <em>Arizona</em> This query returns all objects in the Broker containing the word <em>Arizona</em>. <tag/Boolean query:/ <em>Arizona AND desert</em> This query returns all objects in the Broker that contain both words anywhere in the object in any order. <tag/Phrase query:/ <em>"Arizona desert"</em> This query returns all objects in the Broker that contain <em>Arizona desert</em> as a phrase. Notice that you need to put double quotes around the phrase. <tag/Boolean queries with phrases:/ <em>"Arizona desert" AND windsurfing</em> This query returns all objects in the Broker that contain <em>Arizona desert</em> as a phrase and the word windsurfing. <tag/Simple Structured query:/ <em>Title : windsurfing</em> This query returns all objects in the Broker where the <em>Title</em> attribute contains the value <em>windsurfing</em>. <tag/Complex query:/ <em>"Arizona desert" AND (Title : windsurfing)</em> This query returns all objects in the Broker that contain the phrase <em>Arizona desert</em> and where the <em>Title</em> attribute of the same object contains the value <em>windsurfing</em>. </descrip> <sect2>Regular expressions <p> Some types of regular expressions are supported by Glimpse. A regular expression search can be much slower that other searches. The following is a partial list of possible patterns. (For more details see the <htmlurl url="http://webglimpse.org/gdocs.html" name="Glimpse documentations">.) <itemize> <item><em>^joe</em> will match ``joe'' at the beginning of a line. <item><em>joe$</em> will match ``joe'' at the end of a line. <item><em>[a-ho-z]</em> matches any character between ``a'' and ``h'' or between ``o'' and ``z''. <item><em>.</em> matches any single character except newline. <item><em>c*</em> matches zero or more occurrences of the character ``c''. <item><em>.*</em> matches any number of characters except newline. <item><em>\*</em> matches the character ``*''. (<em>\</em> escapes any of the above special characters.) </itemize> Regular expressions are currently limited to approximately 30 characters, not including meta characters. Regular expressions will generally not cross word boundaries (because only words are stored in the index). So, for example, <em>"lin.*ing"</em> will find ``linking'' or ``flinching,'' but not ``linear programming.'' <sect2>Query options selected by menus or buttons <p> The query page may have following checkboxes to allow some control of the query specification. <descrip> <tag/Case insensitive:/ By selecting this checkbox the query will become case insensitive (lower case and upper case letters don't differ). Otherwise, the query will be case sensitive. The default is case insensitive. <tag/Keywords match on word boundaries:/ By selecting this checkbox, keywords will match on word boundaries. Otherwise, a keyword will match part of a word (or phrase). For example, "network" will match ``networking'', "sensitive" will match ``insensitive'', and "Arizona desert" will match ``Arizona desertness''. The default is to match keywords on word boundaries. <tag/Number of errors allowed:/ Glimpse allows the search to contain a number of errors. An error is either a deletion, insertion, or substitution of a single character. The Best Match option will find the match(es) with the least number of errors. The default is 0 (zero) errors. </descrip> <em>Note:</em> The previous three options do not apply to attribute names. Attribute names are always case insensitive and allow no errors. <sect2>Filtering query results <p> Harvest allows to filter the results of a query by any query term using any attribute defined in the <ref id="List of common SOIF attribute names" name="List of common SOIF attribute names">. This is done by defining <bf>filter</bf> parameters in the query form. It is possible to define more that one filter parameter; they will be concatenated by boolean <bf>AND</bf>. Filter parameters consist of two parts, separated by the pipe symbol ``|''. The first part is a query expression which is attached to the user query using <bf>AND</bf> before sending the request to the broker. The optional second part is a HTML text that shall be displayd on the results page, to give the user some information on the applied filter. Example: <tscreen><verb> <SELECT NAME="filter"> <OPTION VALUE=''>No Filter <OPTION VALUE='uri: "xyz\.edu"|Seach only xyz.edu'>Search xyz.edu only <OPTION VALUE='type: html|HTML documents only'>Search HTML documents only </SELECT> </verb></tscreen> The first option returns an unfiltered output. The second option returns only pages found on pages with ``xyz.edu'' in their URL. The third option returns only HTML-documents. See the advanced search page of the broker for more examples. <sect2>Result set presentation <p> The query page may have following checkboxes allow some control of presentation of the query return. <descrip> <tag/Display matched lines (from content summaries):/ By selecting this checkbox, the result set presentation will contain the lines of the Content Summary that matched the query. Otherwise, the matched lines will not be displayed. The default is to display the matched lines. <tag/Display object descriptions (if available):/ Some objects have short, one-line descriptions associated with them. By selecting this checkbox, the descriptions will be presented. Otherwise, the object descriptions will not be displayed. The default is to display object descriptions. <tag/Display links to indexed content summary:/ This checkbox allows you to set whether links to the indexed content summaries are displayed or not. The default is not to display links to inexed content summaries. </descrip> <sect1>Customizing the Broker's Query Result Set <p> It is possible for the Harvest administrator to customize how the Broker query result set is generated, by modifying a configuration file that is interpreted by the <tt>search.cgi</tt> Perl program at query result time. <tt>search.cgi</tt> allows you to customize almost every aspect of its HTML output. The file <em>$HARVEST_HOME/cgi-bin/lib/search.cf</em> contains the default output definitions. Individual brokers can be customized by creating a similar file which overrides the default definitions. <sect2>The search.cf configuration file <label id="The search.cf configuration file"> <p> Definitions are enclosed within SGML-like beginning and ending tags. For example: <tscreen><verb> <HarvestUrl> http://harvest.sourceforge.net/ </HarvestUrl> </verb></tscreen> The last newline character is removed from each definition, so that the above becomes the string ``http://harvest.sourceforge.net/.'' Variable substitution occurs on every definition before it is output. A number of specific variables are defined by <tt>search.cgi</tt> which can be used inside a definition. For example: <tscreen><verb> <BrokerLoad> Sorry, the Broker at <STRONG>$host, port $port</STRONG> is currently too heavily loaded to process your request. Please try again later.<P> </BrokerLoad> </verb></tscreen> When this definition is printed out, the variables <em>$host</em> and <em>$port</em> would be replaced with the hostname and port of the broker. <sect3>Defined Variables <p> The following variables are defined as soon as the query string is processed. They can be used before the broker returns any results. <tscreen><verb> $maxresult The maximum number of matched lines to be returned $host The broker hostname $port The broker port $query The query string entered by the user $bquery The whole query string sent to the broker </verb></tscreen> These variables are defined for each matched object returned by the broker. <tscreen><verb> $objectnum The number of the returned object $desc The description attribute of the matched object $opaque ALL the matched lines from the matched object $url The original URL of the matched object $A The access method of $url (e.g.: http) $H The hostname (including port) from $url $P The path part of $url $D The directory part of $P $F The filename part of $P $cs_url The URL of the content summary in the broker database $cs_a Access part of $cs_url $cs_h Hostname part of $cs_url $cs_p Path part of $cs_url $cs_d Directory part of $cs_p $cs_f Filename part of $cs_p </verb></tscreen> <sect3>List of Definitions <p> Below is a partial list of definitions. A complete list can be found in the search.cf file. Only definitions likely to be customized are described here. <descrip> <tag><bf><Timeout></bf></tag> Timeout value for <tt>search.cgi</tt>. If the broker doesn't respond within this time, <tt>search.cgi</tt> will exit. <tag><bf><ResultHeader></bf></tag> The first part of the result page. Should probably contain the HTML <bf><TITLE></bf> element and the user query string. <tag><bf><ResultTrailer></bf></tag> The last part of the result page. The default has URL references to the broker home page and the Harvest project home page. <tag><bf><ResultSetBegin></bf></tag> This is output just before looping over all the matched objects. <tag><bf><ResultSetEnd></bf></tag> This is output just after ending the loop over matched objects. <tag><bf><PrintObject></bf></tag> This definition prints out a matched object. It should probably include the variables <em>$url, $cs_url, $desc</em>, and <em>$opaque</em>. <tag><bf><EndBrokerResults></bf></tag> Printed between <bf><ResultSetEnd></bf> and <bf><ResultTrailer></bf> if the query was successful. Should probably include a count of matched objects and/or matched lines. <tag><bf><FailBrokerResults></bf></tag> Similar to <bf><EndBrokerResults></bf> but prints if the broker returns an error in response to the query. <tag><bf><ObjectNumPrintf></bf></tag> A <tt>printf</tt> format string for the object number (<em>$objectnum</em>). <tag><bf><TruncateWarning></bf></tag> Prints a warning message if the result set was truncated at the maximum number of matched lines. </descrip> These following definitions are somewhat different because they are evaluated as Perl instructions rather than strings. <descrip> <tag><bf><MatchedLineSub></bf></tag> Evaluated for every matched line returned by the broker. Can be used to indent matched lines or to remove the leading ``Matched line'' and attribute name strings. <tag><bf><InitFunction></bf></tag> Evaluated near the beginning of the <tt>search.cgi</tt> program. Can be used to set up special variables or read data files. <tag><bf><PerObjectFunction></bf></tag> Evaluated for each object just before <bf><PrintObject></bf> is called. <tag><bf><FormatAttribute></bf></tag> Evaluated for each SOIF attribute requested for matched objects (see Section <ref id="Displaying SOIF attributes in results" name="Displaying SOIF attributes in results">). <em>$att</em> is set to the attribute name, and <em>$val</em> is set to the attribute value. </descrip> <sect2>Example search.cf customization file <p> The following definitions demonstrate how to change the <tt>search.cgi</tt> output. The <bf><PerObjectFunction></bf> ensures that the description is not empty. It also prepends the string ``matched data:'' before any matched lines. The <bf><PrintObject></bf> specification prints the object number, description, and indexing data all on the first line. The description is wrapped around HMTL anchor tags so that it is a link to the object originally gathered. The words ``indexing data'' are a link to the displaySOIF program which will format the content summary for HTML browsers. The object number is formatted as a number in parenthesis such that the whole thing takes up four spaces. The <bf><MatchedLineSub></bf> definition includes four substitution expressions. The first removes the words ``Matched line:'' from the beginning of each matched line. The second removes SOIF attributes of the form ``<em>partial-text{43}:</em>'' from the beginning of a line. The third displays the attribute names (e.g. <em>partial-text#</em>) in italics. The last expression indents each line by five spaces to align it with the description line. The definition for <bf><EndBrokerResults></bf> slightly modifies the report of how many objects were matched. <tscreen><verb> # Demo to show some of the customization features for the Harvest output # More information can be found in the manual at: # http://harvest.sourceforge.net/harvest/doc/html/manual.html # The PerObjectFunction is Perl code evaluated for every hit <PerObjectFunction> # Create description # Is the descriptions provided by Harvest very short (e.g. missing <TITLE>)? if (length($desc) < 5) { # Yes: use filename ($F) instead $description = "<I>File:</I> $F"; } else { # No: use description provided by Harvest $description = $desc; } # Format matched lines ("opaque data") if data is present if ($opaque ne '') { $opaque = "<strong>matched lines:</strong><BR>$opaque" } </PerObjectFunction> # PrintObject defines the apperance of hits <PrintObject> $objectnum <A HREF="$url"><STRONG>$description</STRONG></A> \ [<A HREF="$cs_a://$cs_h/Harvest/cgi-bin/displaySOIF.cgi?object=$cs_p">\ indexing data</A>] <pre> $opaque </pre>\n </PrintObject> # Format the appearance of the hit number <ObjectNumPrintf> (%2d) </ObjectNumPrintf> # Format the appearance of every matched line <MatchedLineSub> s/^Matched line: *//; # Remove "Matched line:" s/^([\w-]+# )[\w-]+{\d+}:\t/\1/; # Remove SOIF attributes of the form "partial-text{43}:" s/^([\w-]+#)/<I>\1<\/I>/; # Format attribute names as italics s/^.*/ $&/; # Add spaces to indent text </MatchedLineSub> # Modifies the report of how many objects were matched <EndBrokerResults> <STRONG>Found $nopaquelines matched lines, $nobjects objects.</STRONG> <P>\n </EndBrokerResults> </verb></tscreen> <sect2>Integrating your customized configuration file <p> The <tt>search.cgi</tt> configuration files are kept in <em>$HARVEST_HOME/cgi-bin/lib</em>. The name of a customized file is listed in the <em>query.html</em> form, and passed as an option to the <tt>search.cgi</tt> program. The simplest way to specify the customized file is by placing an <bf><INPUT></bf> tag in the HTML form: <tscreen><verb> <INPUT TYPE="hidden" NAME="brokerqueryconfig" VALUE="custom.cf"> </verb></tscreen> Another way is to allow users to select from different customizations with a <bf><SELECT></bf> list: <tscreen><verb> <SELECT NAME="brokerqueryconfig"> <OPTION VALUE=""> Default <OPTION VALUE="custom1.cf"> Customized <OPTION VALUE="custom2.cf" SELECTED> Highly Customized </SELECT> </verb></tscreen> <sect2>Displaying SOIF attributes in results <label id="Displaying SOIF attributes in results"> <p> It is possible to request SOIF attributes from the HTML query form. A simple approach is to include a select list in the query form: <tscreen><verb> <SELECT MULTIPLE NAME="attribute"> <OPTION VALUE="title"> <OPTION VALUE="author"> <OPTION VALUE="date"> <OPTION VALUE="subject"> </SELECT> </verb></tscreen> In this manner, the user may control which attributes get displayed. The layout of these attributes when the results are displayed in HTML is controlled by the <bf><FormatAttribute></bf> specification in the <em>search.cf</em> file described in Section <ref id="The search.cf configuration file" name="The search.cf configuration file">. <sect1>World Wide Web interface description <p> To allow Web browsers to easily interface with the Broker, we implemented a World Wide Web interface to the Broker's query manager and administrative interfaces. This WWW interface, which includes several HTML files and a few programs that use the <htmlurl url="http://hoohoo.ncsa.uiuc.edu/cgi/overview.html" name="Common Gateway Interface"> (CGI), consists of the following: <itemize> <item>HTML files that use <htmlurl url="http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/fill-out-forms/overview.html" name="Forms"> support to present a graphical user interface (GUI) to the user; <item>CGI programs that act as a gateway between the user and the Broker; and <item>Help files for the user. </itemize> Users go through the following steps when using a Broker to locate information: <enum> <item>The user issues a query to the Broker. <item>The Broker processes the query, and returns the query results to the user. <item>The user can then view content summaries from the result set, or access the URLs from the result set directly. </enum> To provide a WWW-queryable interface, the Broker needs to run in conjunction with an HTTP server. Section <ref id="Additional installation for the Harvest Broker" name="Additional installation for the Harvest Broker"> describes how to configure your HTTP server to work with Harvest. You can run the Broker on a different machine than your HTTP server runs on, but if you want users to be able to view the Broker's content summaries then the Broker's files will need to be accessible to your HTTP server. You can NFS mount those files or manually copy them over. You'll also need to change the <em>Brokers.cf</em> file to point to the host that is running the Broker. <sect2>HTML files for graphical user interface <p> <tt>CreateBroker</tt> creates some HTML files to provide GUIs to the user: <descrip> <tag><em>query.html</em></tag> Contains the GUI for the query interface. <tt>CreateBroker</tt> will install different <em>query.html</em> files for Glimpse, Swish, and WAIS, since each subsystem requires different defaults and supports different functionality (e.g., WAIS doesn't support approximate matching like Glimpse). This is also the ``home page'' for the Broker and a link to this page is included at the bottom of all query results. <tag><em>admin.html</em></tag> Contains the GUI for the administrative interface. This file is installed into the <em>admin</em> directory of the Broker. <tag><em>Brokers.cf</em></tag> Contains the hostname and port information for the supported brokers. This file is installed into the <em>$HARVEST_HOME/brokers</em> directory. The <em>query.html</em> file uses the value of the ``broker'' FORM tag to pass the name of the broker to <tt>search.cgi</tt> which in turn retrieves the host and port information from <em>Brokers.cf</em>. </descrip> <sect2>CGI programs <label id="CGI programs"> <p> When you install the WWW interface (see Section <ref id="The Broker" name="The Broker">), a few programs are installed into your HTTP server's <em>/Harvest/cgi-bin</em> directory: <descrip> <tag><tt>search.cgi</tt></tag> This program takes the submitted query from <em>query.html</em>, and sends it to the specified Broker. It then retrieves the query results from the Broker, formats them in HTML, and sends the result set in HTML to the user. <tag><tt>displaySOIF.cgi</tt></tag> This program displays the content summaries from the Broker. <tag><tt>BrokerAdmin.pl.cgi</tt></tag> This program will take the submitted administrative command from <em>admin.html</em> and send it to the appropriate Broker. It retrieves the result of the command from the Broker and displays it to the user. </descrip> <sect2>Help files for the user <p> The WWW interface to the Broker includes a few help files written in HTML. These files are installed on your HTTP server in the <em>/Harvest/brokers</em> directory when you install the broker (see Section <ref id="The Broker" name="The Broker">): <descrip> <tag><em>queryhelp.html</em></tag> Provides a tutorial on constructing Broker queries, and on using the <em>query.html</em> forms. <em>query.html</em> has a link to this help page. <tag><em>adminhelp.html</em></tag> Provides a tutorial on submitting Broker administrative commands using the <em>admin.html</em> form. <em>admin.html</em> has a link to this help page. <tag><em>soifhelp.html</em></tag> Provides a brief description of SOIF. </descrip> <sect1>Administrating a Broker <label id="Administrating a Broker"> <p> Administrators have two basic ways for managing a Broker: through the <em>broker.conf</em> and <em>Collection.conf</em> configuration files, and through the interactive administrative interface. The interactive interface controls various facilities and operating parameters within the Broker. We provide a HTML interface page for these administrative commands. See Section <ref id="Collector interface description: Collection.conf" name="Collector interface description: Collection.conf"> for additional information on the Broker administrative and collector interfaces. The <em>broker.conf</em> file is a list of variable names and their values, which consists of information about the Broker (such as the directory in which it lives) and the port on which it runs. The <em>Collection.conf</em> file (see Section <ref id="Collector interface description: Collection.conf" name="Collector interface description: Collection.conf"> for an example) is a list of collection points from which the Broker collects its indexing information. The <tt>CreateBroker</tt> program automatically generates both of these configuration files. You can manually edit these files if needed. The <tt>CreateBroker</tt> program also creates the <em>admin.html</em> file, which is the WWW interface to the Broker's administrative commands. Note that all administrative commands require a password as defined in <em>broker.conf</em>. <em>Note:</em> Changes to the Broker configuration are not saved when the Broker is restarted. Permanent changes to the Broker configuration should be made by manually editing the <em>broker.conf</em> file. The administrative interface created by <tt>CreateBroker</tt> has the following window fields: <tscreen><verb> Command Select an administrative command. See below for a description of the commands. Parameters Specify parameters for those commands that need them. Password The administrative password. Broker Host The host where the broker is running. Broker Port The port where the broker is listening. </verb></tscreen> The administrative interface created by <tt>CreateBroker</tt> supports the following commands: <descrip> <tag><bf>Add objects by file:</bf></tag> Add object(s) to the Broker. The parameter is a list of filenames that contain SOIF object to be added to the Broker. <tag><bf>Close log:</bf></tag> Flush all accumulated log information and close the current log file. Causes the Broker to stop logging. No parameters. <tag><bf>Compress Registry:</bf></tag> Performs garbage collection on the Registry file. No parameters. <tag><bf>Delete expired objects:</bf></tag> Deletes any object from the Broker whose <em>Time-to-Live</em> has expired. No parameters. <tag><bf>Delete objects by query:</bf></tag> Deletes any object(s) that matches the given query. The parameter is a query with the same syntax as user queries. Query flags are currently unsupported. <tag><bf>Delete objects by oid:</bf></tag> Deletes the object(s) identified by the given OID numbers. The parameter is a list of OID numbers. The OID numbers can be obtained by using the <tt>dumpregistry</tt> command. <tag><bf>Disable log type:</bf></tag> Disables logging information about a particular type of event. The parameter is an event type. See Enable log type for a list of events. <tag><bf>Enable log type:</bf></tag> Enables logging information about a particular type of events. The parameter is the name of an event type. Currently, event types are limited to the following: <tscreen><verb> Update Log updated objects. Delete Log deleted objects. Refresh Log refreshed objects. Query Log user queries. Query-Return Log objects returned from a query. Cleaned Log objects removed by the cleaner. Collection Log collection events. Admin Log administrative events. Admin-Return Log the results of administrative events. Bulk-Transfer Log bulk transfer events. Bulk-Return Log objects sent by bulk transfers. Cleaner-On Log cleaning events. Compressing-Registry Log registry compression events. All Log all events. </verb></tscreen> <tag><bf>Flush log:</bf></tag> Flush all accumulated log information to the current log file. No parameters. <tag><bf>Generate statistics:</bf></tag> Generates some basic statistics about the Broker object database. No parameters. <tag><bf>Index changes:</bf></tag> Index only the objects that have been added recently. No parameters. <tag><bf>Index corpus:</bf></tag> Index the <em>entire</em> object database. No parameters. <tag><bf>Open log:</bf></tag> Open a new log file. If the file does not exist, create a new one. The parameter is the name (relative to the broker) of a file to use for logging. <tag><bf>Restart server:</bf></tag> Force the broker to reread the Registry and reindex the corpus. This does not actually kill the broker process. No parameters. <tag><bf>Rotate log file:</bf></tag> Rotates the current log file to LOG.YYYYMMDD. Opens a new log file. No parameters. <tag><bf>Set variable:</bf></tag> Sets the value of a broker configuration variable. Takes two parameters, the name of a configuration variable and the new value for the variable. The configuration variables that can be set are those that occur in the <em>broker.conf</em> file. The change only is valid until the broker process dies. <tag><bf>Shutdown server:</bf></tag> Cleanly shutdown the Broker. No parameters. <tag><bf>Start collection:</bf></tag> Perform collections. No parameters. <tag><bf>Delete older objects of duplicate URLs:</bf></tag> Occasionally a broker may end up with multiple summarizes for individual URLs. This can happen when the Gatherer changes its description, hostname, or port number. Use this command to search the broker for duplicated URLs. When two objects with the same URL are found, the object with the least-recent timestamp is removed. </descrip> <sect2>Deleting unwanted Broker objects <p> If you build a Broker and then decide not to index some of that data (e.g., you decide it would make sense to split it into two different Brokers, each targetted to a different community), you need to change the Gatherer's configuration file, rerun the Gatherer, and then let the old objects time out in the Broker (since the Broker and Gatherer maintain separate databases). If you want to clean out the Broker's data sooner than that you can use the Broker's administrative interface in one of three ways: <enum> <item>Use the 'Remove object by name' command. This is only reasonable if you have a small number of objects to remove in the Broker. <item>Use the 'Remove object by query'. This might be the best option if, for example, you can construct a regular expression based on the URLs you want to remove. <item>Shutdown the server, manually remove the Broker's <em>objects/*</em> files, and then restart the Broker. This is easiest, although if you have a large number of objects it will take longer to rebuild the index. A simple way to accomplish this is by ``rebooting'' the Broker by deleting all the current objects, and doing a full collection, as follows: <tscreen><verb> % mv objects objects.old % rm -rf objects.old & % broker ./admin/broker.conf -new </verb></tscreen> </enum> After removing objects, you should use the <em>Index corpus</em> command. <sect2>Command-line Administration <p> It is possible to perform administrative functions by using the <tt>brkclient</tt> program from the command-line and shell scripts. For example, to force a collection, run: <tscreen><verb> % brkclient localhost 8501 '#ADMIN #Password secret #collection' </verb></tscreen> See your broker's raw <em>admin.html</em> file for a complete list of administrative commands. <sect1>Tuning Glimpse indexing in the Broker <label id="Tuning Glimpse indexing in the Broker"> <p> The Glimpse indexing system can be tuned in a variety of ways to suit your particular needs. Probably the most noteworthy parameter is indexing granularity, for which Glimpse provides three options: a tiny index (2-3% of the total size of all files -- your mileage may vary), a small index (7-8%), and a medium-size index (20-30%). Search times are better with larger indexes. By changing the <bf>GlimpseIndex-Option</bf> in your Broker's <em>broker.conf</em> file, you can tune Glimpse to use one of these three indexing granularity options. By default, <bf>GlimpseIndex-Option</bf> builds a medium-size index using the <tt>glimpseindex</tt> program. Note also that with Glimpse it is much faster to search with ``show matched lines'' turned off in the Broker query page. Glimpse uses a ``stop list'' to avoid indexing very common words. This list is not fixed, but rather computed as the index is built. For a medium-size index, the default is to put any word that appears at least 500 times per Mbyte (on the average) in the stop-list. For a small-size index, the default is words that appear in at least 80% of all files (unless there are fewer than 256 files, in which case there is no stop-list). Both defaults can be changed using the <bf>-S</bf> option, which should be followed by the new number (average per Mbyte when <bf>-b</bf> indexing is used, or % of files when <bf>-o</bf> indexing is used). Tiny-size indexes do not maintain a stop-list (their effect is minimal). <tt>glimpseindex</tt> includes a number of other options that may be of interest. You can find out more about these options (and more about Glimpse in general) in the <htmlurl url="http://webglimpse.org/gdocs.html" name="Glimpse documentations">. If you'd like to change how the Broker invokes the <tt>glimpseindex</tt> program, then edit the <em>src/broker/Glimpse/index.c</em> file from the Harvest source distribution. <sect2>The glimpseserver program <p> The Glimpse system comes with an auxiliary server called <tt>glimpseserver</tt>, which allows indexes to be read into a process and kept in memory. This avoids the added cost of reading the index and starting a large process for each search. <tt>glimpseserver</tt> is automatically started each time you run the Broker, or reindex the Broker's corpus. If you do not want to run <tt>glimpseserver</tt>, then set <bf>GlimpseServer-Host</bf> to ``false'' in your <em>broker.conf</em>. <sect1>Using different index/search engines with the Broker <label id="Using different index/search engines with the Broker"> <p> By default, Harvest uses the Glimpse index/search subsystem. However, Harvest defines a flexible indexing interface, to allow Broker administrators to use different index/search subsystems to accommodate domain-specific requirements. For example, it might be useful to provide a relational database back-end. At present we distribute code to support an interface to both the free and the commercial WAIS index/search engines, Glimpse, and Swish. Below we discuss how to use other index/search engine instead of Glimpse in the Broker, and provide some brief discussion of how to integrate a new index/search engine into the Broker. <sect2>Using Swish as an indexer <p> Harvest includes support for using Swish as indexing engine with the Broker. Swish is a nice alternative to Glimpse if you need faster search support and are willing to lose the more powerful query features. It also is an alternative in cases of trouble with Glimpse' copyright status. To use Swish with an existing Broker, you need to change the <em>Indexer-Type</em> variable in <em>broker.conf</em> to ``Swish''. You can also specify that you want to use Swish for a Broker, when you use the <tt>RunHarvest</tt> command by running: <tt>RunHarvest -swish</tt>. <sect2>Using WAIS as an indexer <p> Support for using WAIS (both freeWAIS and WAIS Inc.'s index/search engine) as the Broker's indexing and search subsystem is included in the Harvest distribution. WAIS is a nice alternative to Glimpse if you need faster search support and are willing to lose the more powerful query features. To use WAIS with an existing Broker, you need to change the <em>Indexer-Type</em> variable in <em>broker.conf</em> to ``WAIS''; you can choose among the WAIS variants by setting the <em>WAIS-Flavor</em> variable in <em>broker.conf</em> to ``Commercial-WAIS'', ``freeWAIS'', or ``WAIS''. Otherwise, <tt>CreateBroker</tt> will ask you if you want to use WAIS, and where the WAIS programs (<tt>waisindex</tt>, <tt>waissearch</tt>, <tt>waisserver</tt>, and with the commercial version of WAIS <tt>waisparse</tt>) are located. When you run the Broker, a WAIS server will be started automatically after the index is built. You can also specify that you want to use WAIS for a Broker, when you use the <tt>RunHarvest</tt> command by running: <tt>RunHarvest -wais</tt>. <sect1>Collector interface description: Collection.conf <label id="Collector interface description: Collection.conf"> <p> The Broker retrieves indexing information from Gatherers or other Brokers through its <em>Collector</em> interface. A list of collection points is specified in the <em>admin/Collection.conf</em> configuration file. This file contains a collection point on each line, with 4 fields. The first field is the host of the remote Gatherer or Broker, the second field is the port number on that host, the third field is the collection type, and the forth field is the query filter or <bf>--</bf> if there is no filter. The Broker supports various types of collections as described below: <tscreen><verb> Type Remote Process Description Compression? -------------------------------------------------------- 0 Gatherer Full collection each time No 1 Gatherer Incremental collections No 2 Gatherer Full collection each time Yes 3 Gatherer Incremental collections Yes 4 Broker Full collection each time No 5 Broker Incremental collections No 6 Broker Collection based on a query No 7 Broker Incremental based on a query No </verb></tscreen> The query filter specification for collection types 6 and 7 contains two parts: the <bf>--QUERY keywords</bf> portion and an optional <bf>--FLAGS flags</bf> portion. The <bf>--QUERY</bf> portion is passed on to the Broker as the keywords for the query (the keywords can be any Boolean and/or structured query); the <bf>--FLAGS</bf> portion is passed on to the Broker as the indexer-specific flags to the query. The following table shows the valid indexer-specific flags for the supported indexers: <tscreen><verb> Indexer Flag Description ----------------------------------------------------------------------------- All: #desc Show Description Lines Glimpse: #index case insensitive Case Insensitive #index case sensitive Case sensitive #index error number Allow "number" errors #index matchword Matches on word boundaries #index maxresult number Allow max of "number" results #opaque Show matched lines Wais: #index maxresult number Allow max of "number" results #opaque Show scores and rankings </verb></tscreen> The following is an example <em>Collection.conf</em>, which collects information from 2 Gatherers (one compressed incrementals and the other uncompressed full transfers), and collects information from 3 Brokers (one incrementally based on a timestamp, and the others using query filters): <tscreen><verb> gatherer-host1.foo.com 8500 3 -- gatherer-host2.foo.com 8500 0 -- broker-host1.foo.com 8501 5 -- broker-host2.foo.com 8501 6 --QUERY (URL : document) AND gnu broker-host3.foo.com 8501 7 --QUERY Harvest --FLAGS #index case sensitive </verb></tscreen> <sect1>Troubleshooting <p> <descrip> <tag/Symptom/ The Broker is running but always returns <em>empty query results</em>. <tag/Solution/ Look at the log messages in the broker.out file in the Broker's directory for error messages. If your Broker didn't index the data, use the administrative interface to force the Broker to build the index (see Section <ref id="Administrating a Broker" name="Administrating a Broker">). <tag/Symptom/ When I query my Broker, I get a "500 Server Error". <tag/Solution/ Generally, the ``500'' errors are related to a CGI program not working correctly or a misconfigured httpd server. Make sure that the userid running the HTTP server has access to the Harvest cgi-bin directory and the Perl include files in <em>$HARVEST_HOME/lib</em>. Refer to Section <ref id="Additional installation for the Harvest Broker" name="Additional installation for the Harvest Broker"> for further details. <tag/Symptom/ I see <em>duplicate documents</em> in my Broker. <tag/Solution/ The Broker performs duplicate elimination based on a combination of MD5 checksums and Gatherer-Host, Name, Version. Therefore, you can end up with duplicate documents if your Broker collects from more than one Gatherer, each of which gathers from the (a subset of) the same URLs. (As an aside, the reason for this notion of duplicate elimination is to allow a single Broker to contain several different SOIF objects for the same URL, but summarized in different ways.) Two solutions to the problem are: <enum> <item>Run your Gatherers on the same host. <item>Remove the duplicate URLs in a customized version of the <tt>search.cgi</tt> program by doing a string comparison of the URLs. </enum> <tag/Symptom/ The Broker takes a <em>long time</em> and does not answer queries. <tag/Solution/ Some queries are quite expensive, because they involve a great deal of I/O. For this reason we modified the Broker so that if a query takes longer than 5 minutes, the query process is killed. The best solution is to use a less expensive query, for example by using less common keywords. <tag/Symptom/ Some of the <em>query options</em> (such as structured or case sensitive queries) <em>aren't working</em>. <tag/Solution/ This usually means you are using an index/search engine that does not support structured queries (like the current Harvest support for commercial WAIS). If you are setting up your own Broker (rather than using someone else's Broker), see Section <ref id="Using different index/search engines with the Broker" name="Using different index/search engines with the Broker"> for details on how to switch to other index/search engines. Or, it could be that your <tt>search.cgi</tt> program is an old version and should be updated. <tag/Symptom/ I get <em>syntax errors</em> when I specify queries. <tag/Solution/ Usually this means you did not use double quotes where needed. See Section <ref id="Querying a Broker" name="Querying a Broker">. <tag/Symptom/ When I submit a query, I get an <em>answer faster than I can believe</em> it takes to perform the query, and the answer contains <em>garbage data</em>. <tag/Solution/ This probably indicates that your <tt>httpd</tt> is misconfigured. A common case is not putting the 'ScriptAlias' before the 'Alias' in your <em>conf/httpd.conf</em> file, when running the Apache <tt>httpd</tt>. See Section <ref id="Additional installation for the Harvest Broker" name="Additional installation for the Harvest Broker">. <tag/Symptom/ When I make <em>changes</em> to the Broker configuration via the <em>administration interface</em>, they are <em>lost</em> after the Broker is restarted. <tag/Solution/ The Broker administration interface does not save changes across sessions. Permanent changes to the Broker configuration should be done through the <em>broker.conf</em> file. <tag/Symptom/ My Broker is <em>running very slowly</em>. <tag/Solution/ Performance tuning can be complicated, but the most likely problem is that you are running on a machine with insufficient RAM, and paging a lot because the query engine kicks pages out in order to access the needed index and data files. (In UNIX the disk buffer cache competes with program and data pages for memory.) A simple way to tell is to run ``vmstat 5'' in one window, and after a couple of lines of output, issue a query from another window. This will print a line of measurements about the virtual memory status of your machine every 5 seconds. In particular, look at the ``pi'' and ``po'' columns. If the numbers suddenly jump into the 500-1,000 range after you issue the query, you are paging a lot. Note that paging problems are accentuated by running simultaneous memory-intensive or disk I/O-intensive programs on your machine. Simultaneous queries to a single Broker should not cause a paging problem, because the Broker processes the queries sequentially. It is best to run Brokers on an otherwise mostly unused machine with at least 128 MB of RAM (or more, if the above ``vmstat'' experiment indicates you are paging alot). One other performance enhancer is to run an <em>httpd-accelerator</em> on your Broker machine, to intercept queries headed for your Broker. While it will not cache the results of queries, it will reduce load on the machine because it provides a very efficient means of returning results in the case of concurrent queries. Without the accelerator the results are sent back by a <tt>search.cgi</tt> UNIX process per query, and inefficiently time sliced by the UNIX kernel. With an accelerator the <tt>search.cgi</tt> processes exit quickly, and let the accelerator send the results back to the concurrent users. The accelerator will also reduce load for (non-query) retrievals of data from your httpd server. </descrip> <sect>Programs and layout of the installed Harvest software <label id="Programs and layout of the installed Harvest software"> <p> <sect1>$HARVEST_HOME <p> The top directory of where you installed Harvest is known as <em>$HARVEST_HOME</em>. By default, <em>$HARVEST_HOME</em> is <em>/usr/local/harvest</em>. The following files and directories are located in <em>$HARVEST_HOME</em>: <tscreen><verb> RunHarvest* brokers/ gatherers/ tmp/ bin/ cgi-bin/ lib/ </verb></tscreen> <tt>RunHarvest</tt> is the script used to create and run Harvest servers (see Section <ref id="Starting up the system: RunHarvest and related commands" name="Starting up the system: RunHarvest and related commands">). <tt>RunHarvest</tt> has the same command line syntax as <tt>Harvest</tt>. <sect1>$HARVEST_HOME/bin <p> The <em>$HARVEST_HOME/bin</em> directory only contains programs that users would normally run directly. All other programs (e.g., individual summarizers for the Gatherer) as well as Perl library code are in the <em>lib</em> directory. The <em>bin</em> directory contains the following programs: <descrip> <tag><tt>CreateBroker</tt></tag> Creates a Broker. Usage: <tt>CreateBroker [skeleton-tree [destination]]</tt> <tag><tt>Gatherer</tt></tag> Main user interface to the Gatherer. This program is run by the <tt>RunGatherer</tt> script found in a Gatherer's directory. Usage: <tt>Gatherer [-manual|-export|-debug] file.cf</tt> <tag><tt>Harvest</tt></tag> The program used by <tt>RunHarvest</tt> to create and run Harvest servers as per the user's description. Usage: <tt>Harvest [flags]</tt> where flags can be any of the following: <tscreen><verb> -novice Simplest Q&A. Mostly uses the defaults. -glimpse Use Glimpse for the Broker. (default) -swish Use Swish for the Broker. -wais Use WAIS for the Broker. -dumbtty Dumb TTY mode. -debug Debug mode. -dont-run Don't run the Broker or the Gatherer. -fake Doesn't build the Harvest servers. -protect Don't change the umask. </verb></tscreen> <tag><tt>broker</tt></tag> The Broker program. This program is run by the <tt>RunBroker</tt> script found in a Broker's directory. Logs messages to both <em>broker.out</em> and to <em>admin/LOG</em>. Usage: <tt>broker [broker.conf file] [-nocol]</tt> <tag><tt>gather</tt></tag> The client interface to the Gatherer. Usage: <tt>gather [-info] [-nocompress] host port [timestamp]</tt> </descrip> <sect1>$HARVEST_HOME/brokers <p> The <em>$HARVEST_HOME/brokers</em> directory contains images and logos in <em>images</em> directory, some basic tutorial HTML pages, and the skeleton files that <tt>CreateBroker</tt> uses to construct new Brokers. You can change the default values in these created Brokers by editing the files in <em>skeleton</em>. <sect1>$HARVEST_HOME/cgi-bin <p> The <em>$HARVEST_HOME/cgi-bin</em> directory contains the programs needed for the WWW interface to the Broker (described in Section <ref id="CGI programs" name="CGI programs">) and configuration files for <tt>search.cgi</tt> in <em>lib</em> directory. <sect1>$HARVEST_HOME/gatherers <p> The <em>$HARVEST_HOME/gatherers</em> directory contains example Gatherers discussed in Section <ref id="Gatherer Examples" name="Gatherer Examples">. <tt>RunHarvest</tt>, by default, will create the new Gatherer in this directory. <sect1>$HARVEST_HOME/lib <p> The <em>$HARVEST_HOME/lib</em> directory contains number of Perl library routines and other programs needed by various parts of Harvest, as follows: <descrip> <tag><em>chat2.pl, ftp.pl, socket.ph</em></tag> Perl libraries used to communicate with remote FTP servers. <tag><em>dateconv.pl, lsparse.pl, timelocal.pl</em></tag> Perl libraries used to parse <tt>ls</tt> output. <tag><tt>ftpget</tt></tag> Program used to retrieve files and directories from FTP servers. Usage: <tt>ftpget [-htmlify] localfile hostname filename A,I username password</tt> <tag><tt>gopherget.pl</tt></tag> Perl program used to retrieve files and menus from Gopher servers. Usage: <tt>gopherget.pl localfile hostname port command</tt> <tag><tt>harvest-check.pl</tt></tag> Perl program to check whether gatherers and brokers are up. Usage: <tt>harvest-check.pl [-v]</tt> <tag><tt>md5</tt></tag> Program used to compute MD5 checksums. Usage: <tt>md5 file [...]</tt> <tag><tt>newsget.pl</tt></tag> Perl program used to retrieve USENET articles and group summaries from NNTP servers. Usage: <tt>newsget.pl localfile news-URL</tt> <tag><em>soif.pl, soif-mem-efficient.pl</em></tag> Perl library used to process SOIF. <tag><tt>urlget</tt></tag> Program used to retrieve a URL. Usage: <tt>urlget URL</tt> <tag><tt>urlpurge</tt></tag> Program to purge the local disk URL cache used by <tt>urlget</tt> and the Gatherer. Usage: <tt>urlpurge</tt> </descrip> <sect1>$HARVEST_HOME/lib/broker <p> The <em>$HARVEST_HOME/lib/broker</em> directory contains the search and index programs needed by the Broker, plus several utility programs needed for Broker administration, as follows: <descrip> <tag><tt>BrokerRestart</tt></tag> This program will issue a restart command to a broker. Usage: <tt>BrokerRestart [-password passwd] host port</tt> <tag><tt>brkclient</tt></tag> Client interface to the broker. Can be used to send queries or administrative commands to a broker. Usage: <tt>brkclient hostname port command-string</tt> <tag><tt>dumpregistry</tt></tag> Prints the Broker's Registry file in a human-readable format. Usage: <tt>dumpregistry [-count] [BrokerDirectory]</tt> <tag><tt>agrep, glimpse, glimpseindex, glimpseindex.bin, glimpseserver</tt></tag> The Glimpse indexing and search system as described in Section <ref id="The Broker" name="The Broker">. <tag><tt>swish</tt></tag> The Swish indexing and search program as an alternative to Glimpse. <tag><tt>info-to-html.pl, mkbrokerstats.pl</tt></tag> Perl programs used to generate Broker statistics and to create <em>stats.html</em>. Usage: <tt>gather -info host port | info-to-html.pl > host.port.html</tt> Usage: <tt>mkbrokerstats.pl broker-dir > stats.html</tt> </descrip> <sect1>$HARVEST_HOME/lib/gatherer <p> The <em>$HARVEST_HOME/lib/gatherer</em> directory contains the default summarizers described in Section <ref id="Extracting data for indexing: The Essence summarizing subsystem" name="Extracting data for indexing: The Essence summarizing subsystem">, plus various utility programs needed by the summarizers and the Gatherer, as follows: <descrip> <tag><em>URL-filter-default</em></tag> Default URL filter as described in Section <ref id="RootNode specifications" name="RootNode specifications">. <tag><em>bycontent.cf, byname.cf, byurl.cf, magic, stoplist.cf, quick-sum.cf</em></tag> Essence configuration files as described in Section <ref id="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps" name="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps">. <tag><tt>*.sum</tt></tag> Essence summarizers as discussed in Section <ref id="Extracting data for indexing: The Essence summarizing subsystem" name="Extracting data for indexing: The Essence summarizing subsystem">. <tag><tt>HTML-sum.pl</tt></tag> Alternative HTML summarizer written in Perl. <tag><tt>HTMLurls</tt></tag> Program to extract URLs from a HTML file. Usage: <tt>HTMLurls [--base-url url] filename</tt> <tag><tt>catdoc, xls2csv,</tt> <em>catdoc-lib</em></tag> Programs and files used by Microsoft Word summarizer. <tag><tt>dvi2tty, print-c-comments, ps2txt, ps2txt-2.1, pstext, skim</tt></tag> Programs used by various summarizers. <tag><tt>gifinfo</tt></tag> Program to support summarizers. <tag><tt>l2h</tt></tag> Program used by TeX summarizer. <tag><tt>rast, smgls, sgmlsasp,</tt> <em>sgmls-lib</em></tag> Programs and files used by SGML summarizer. <tag><tt>rtf2html</tt></tag> Program used by RTF summarizer. <tag><tt>wp2x, wp2x.sh,</tt> <em>wp2x-lib</em></tag> Programs and files used by WordPerfect summarizer. <tag><tt>hexbin, unshar, uudecode</tt></tag> Programs used to unnest nested objects. <tag><tt>cksoif</tt></tag> Programs used to check the validity of a SOIF stream (e.g., to ensure that there is not parsing errors). Usage: <tt>cksoif < INPUT.soif</tt> <tag><tt>cleandb, consoldb, expiredb, folddb, mergedb, mkgathererstats.pl, mkindex, rmbinary</tt></tag> Programs used to prepare a Gatherer's database to be exported by <tt>gatherd</tt>. <tt>cleandb</tt> ensures that all SOIF objects are valid, and deletes any that are not; <tt>consoldb</tt> will consolidate n GDBM database files into a single GDBM database file; <tt>expiredb</tt> deletes any SOIF objects that are no longer valid as defined by its <em>Time-to-Live</em> attribute; <tt>folddb</tt> runs all of the operations needed to prepare the Gatherer's database for export by <tt>gatherd</tt>; <tt>mergedb</tt> consolidates GDBM files as described in Section <ref id="Incorporating manually generated information into a Gatherer" name="Incorporating manually generated information into a Gatherer">; <tt>mkgathererstats.pl</tt> generates the <em>INFO.soif</em> statistics file; <tt>mkindex</tt> generates the cache of timestamps; and <tt>rmbinary</tt> removes binary data from a GDBM database. <tag><tt>enum, prepurls, staturl</tt></tag> Programs used by <tt>Gatherer</tt> to perform the RootNode and LeafNode enumeration for the Gatherer as described in Section <ref id="RootNode specifications" name="RootNode specifications">. <tt>enum</tt> performs a RootNode enumeration on the given URLs; <tt>prepurls</tt> is a wrapper program used to pipe <tt>Gatherer</tt> and <tt>essence</tt> together; <tt>staturl</tt> retrieves LeafNode URLs to determine if the URL has been modified or not. <tag><tt>fileenum, ftpenum, ftpenum.pl, gopherenum-*, httpenum-*, newsenum</tt></tag> Programs used by <tt>enum</tt> to perform protocol specific enumeration. <tt>fileenum</tt> performs a RootNode enumeration on ``file'' URLs; <tt>ftpenum</tt> calls <tt>ftpenum.pl</tt> to perform a RootNode enumeration on ``ftp'' URLs; <tt>gopherenum-breadth</tt> performs a breadth first RootNode enumeration on ``gopher'' URLs; <tt>gopherenum-depth</tt> performs a depth first RootNode enumeration on ``gopher'' URLs; <tt>httpenum-breadth</tt> performs a breadth first RootNode enumeration on ``http'' URLs; <tt>httpenum-depth</tt> performs a depth first RootNode enumeration on ``http'' URLs; <tt>newsenum</tt> performs a RootNode enumeration on ``news'' URLs; <tag><tt>essence</tt></tag> The Essence content extraction system as described in Section <ref id="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps" name="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps">. Usage: <tt>essence [options] -f input-URLs</tt> or <tt>essence [options] URL ...</tt> where options are: <tscreen><verb> --dbdir directory Directory to place database --full-text Use entire file instead of summarizing --gatherer-host Gatherer-Host value --gatherer-name Gatherer-Name value --gatherer-version Gatherer-Version value --help Print usage information --libdir directory Directory to place configuration files --log logfile Name of the file to log messages to --max-deletions n Number of GDBM deletions before reorganization --minimal-bookkeeping Generates a minimal amount of bookkeeping attrs --no-access Do not read contents of objects --no-keywords Do not automatically generate keywords --allowlist filename File with list of types to allow --stoplist filename File with list of types to remove --tmpdir directory Name of directory to use for temporary files --type-only Only type data; do not summarize objects --verbose Verbose output --version Version information </verb></tscreen> <tag><tt>print-attr</tt></tag> Reads in a SOIF stream from stdin and prints the data associated with the given attribute to stdout. Usage: <tt>cat SOIF-file | print-attr Attribute</tt> <tag><tt>gatherd, in.gatherd</tt></tag> Daemons that exports the Gatherer's database. <tt>in.gatherd</tt> is used to run this daemon from inetd. Usage: <tt>gatherd [-db | -index | -log | -zip | -cf file] [-dir dir] port</tt> Usage: <tt>in.gatherd [-db | -index | -log | -zip | -cf file] [-dir dir]</tt> <tag><tt>gdbmutil</tt></tag> Program to perform various operations on a GDBM database. <tscreen><verb> Usage: gdbmutil consolidate [-d | -D] master-file file [file ...] Usage: gdbmutil delete file key Usage: gdbmutil dump file Usage: gdbmutil fetch file key Usage: gdbmutil keys file Usage: gdbmutil print [-gatherd] file Usage: gdbmutil reorganize file Usage: gdbmutil restore file Usage: gdbmutil sort file Usage: gdbmutil stats file Usage: gdbmutil store file key < data </verb></tscreen> <tag><tt>mktemplate</tt></tag> Program to generate valid SOIF based on a more easily editable SOIF-like format (e.g., SOIF without the byte counts). Usage: <tt>mktemplate < INPUT.txt > OUTPUT.soif</tt> <tag><tt>quick-sum</tt></tag> Simple Perl program to emulate Essence's <em>quick-sum.cf</em> processing for those who cannot compile Essence with the corresponding C code. <tag><tt>template2db</tt></tag> Converts a stream of SOIF objects (from stdin or given files) into a GDBM database. Usage: <tt>template2db database [tmpl tmpl...]</tt> <tag><tt>wrapit</tt></tag> Wraps the data from stdin into a SOIF attribute-value pair with a byte count. Used by Essence summarizers to easily generate SOIf. Usage: <tt>wrapit [Attribute]</tt> <tag><tt>kill-gatherd</tt></tag> Script to kill gatherd process. </descrip> <sect1>$HARVEST_HOME/tmp <p> The <em>$HARVEST_HOME/tmp</em> directory is used by search.cgi to store search result pages. <sect>The Summary Object Interchange Format (SOIF) <label id="The Summary Object Interchange Format (SOIF)"> <p> Harvest Gatherers and Brokers communicate using an attribute-value stream protocol called the <em>Summary Object Interchange Format (SOIF)</em>, an example of which is available in Section <ref id="Example 1" name="Example 1">. Gatherers generate content summaries for individual objects in SOIF, and serve these summaries to Brokers that wish to collect and index them. SOIF provides a means of bracketing collections of summary objects, allowing Harvest Brokers to retrieve SOIF content summaries from a Gatherer for many objects in a single, efficient compressed stream. Harvest Brokers provide support for querying SOIF data using structured attribute-value queries and many other types of queries, as discussed in Section <ref id="Querying a Broker" name="Querying a Broker">. <sect1>Formal description of SOIF <p> The SOIF Grammar is as follows: <tscreen><verb> SOIF ::= OBJECT SOIF | OBJECT OBJECT ::= @ TEMPLATE-TYPE { URL ATTRIBUTE-LIST } ATTRIBUTE-LIST ::= ATTRIBUTE ATTRIBUTE-LIST | ATTRIBUTE ATTRIBUTE ::= IDENTIFIER {VALUE-SIZE} DELIMITER VALUE TEMPLATE-TYPE ::= Alpha-Numeric-String IDENTIFIER ::= Alpha-Numeric-String VALUE ::= Arbitrary-Data VALUE-SIZE ::= Number DELIMITER ::= ":<tab>" </verb></tscreen> <sect1>List of common SOIF attribute names <label id="List of common SOIF attribute names"> <p> Each Broker can support different attributes, depending on the data it holds. Below we list a set of the most common attributes: <tscreen><verb> Abstract Brief abstract about the object. Author Author(s) of the object. Description Brief description about the object. File-Size Number of bytes in the object. Full-Text Entire contents of the object. Gatherer-Host Host on which the Gatherer ran to extract information from the object. Gatherer-Name Name of the Gatherer that extracted information from the object. (eg. Full-Text, Selected-Text, or Terse). Gatherer-Port Port number on the Gatherer-Host that serves the Gatherer's information. Gatherer-Version Version number of the Gatherer. Update-Time The time that Gatherer updated the content summary for the object. Keywords Searchable keywords extracted from the object. Last-Modification-Time The time that the object was last modified. MD5 MD5 16-byte checksum of the object. Refresh-Rate The number of seconds after Update-Time when the summary object is to be re-generated. Defaults to 1 month. Time-to-Live The number of seconds after Update-Time when the summary object is no longer valid. Defaults to 6 months. Title Title of the object. Type The object's type. Some example types are: Archive Audio Awk Backup Binary C CHeader Command Compressed CompressedTar Configuration Data Directory DotFile Dvi FAQ FYI Font FormattedText GDBM GNUCompressed GNUCompressedTar HTML Image Internet-Draft MacCompressed Mail Makefile ManPage Object OtherCode PCCompressed Patch Pdf Perl PostScript RCS README RFC RTF SCCS ShellArchive Tar Tcl Tex Text Troff Uuencoded WaisSource Update-Time The time that the summary object was last updated. REQUIRED field, no default. URI Uniform Resource Identifier. URL-References Any URL references present within HTML objects. </verb></tscreen> <sect>Gatherer Examples <label id="Gatherer Examples"> <p> The following examples install into <em>$HARVEST_HOME/gatherers</em> by default (see Section <ref id="Installing the Harvest Software" name="Installing the Harvest Software">). The Harvest distribution contains several examples of how to configure, customize, and run Gatherers. This section will walk you through several example Gatherers. The goal is to give you a sense of what you can do with a Gatherer and how to do it. You needn't work through all of the examples; each is instructive in its own right. To use the Gatherer examples, you need the Harvest binary directory in your path, and <em>HARVEST_HOME</em> defined. For example: <tscreen><verb> % setenv HARVEST_HOME /usr/local/harvest % set path = ($HARVEST_HOME/bin $path) </verb></tscreen> <sect1>Example 1 - A simple Gatherer <label id="Example 1"> <p> This example is a simple Gatherer that uses the default customizations. The only work that the user does to configure this Gatherer is to specify the list of URLs from which to gather (see Section <ref id="The Gatherer" name="The Gatherer">). To run this example, type: <tscreen><verb> % cd $HARVEST_HOME/gatherers/example-1 % ./RunGatherer </verb></tscreen> To view the configuration file for this Gatherer, look at <em>example-1.cf</em>. The first few lines are variables that specify some local information about the Gatherer (see Section <ref id="Setting variables in the Gatherer configuration file" name="Setting variables in the Gatherer configuration file">). For example, each content summary will contain the name of the Gatherer (<bf>Gatherer-Name</bf>) that generated it. The port number (<bf>Gatherer-Port</bf>) that will be used to export the indexing information, as is the directory that contains the Gatherer (<bf>Top-Directory</bf>). Notice that there is one RootNode URL and one LeafNode URL. After the Gatherer has finished, it will start up the Gatherer daemon which will export the content summaries. To view the content summaries, type: <tscreen><verb> % gather localhost 9111 | more </verb></tscreen> The following SOIF object should look similar to those that this Gatherer generates. <tscreen><verb> @FILE { http://harvest.cs.colorado.edu/~schwartz/IRTF.html Time-to-Live{7}: 9676800 Last-Modification-Time{1}: 0 Refresh-Rate{7}: 2419200 Gatherer-Name{25}: Example Gatherer Number 1 Gatherer-Host{22}: powell.cs.colorado.edu Gatherer-Version{3}: 0.4 Update-Time{9}: 781478043 Type{4}: HTML File-Size{4}: 2099 MD5{32}: c2fa35fd44a47634f39086652e879170 Partial-Text{151}: research problems Mic Bowman Peter Danzig Udi Manber Michael Schwartz Darren Hardy talk talk Harvest talk Advanced Research Projects Agency URL-References{628}: ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/RD.ResearchProblems.Jour.ps.Z ftp://grand.central.org/afs/transarc.com/public/mic/html/Bio.html http://excalibur.usc.edu/people/danzig.html http://glimpse.cs.arizona.edu:1994/udi.html http://harvest.cs.colorado.edu/~schwartz/Home.html http://harvest.cs.colorado.edu/~hardy/Home.html ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPCC94.Slides.ps.Z ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPC94.Slides.ps.Z http://harvest.cs.colorado.edu/harvest/Home.html ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/IETF.Jul94.Slides.ps.Z http://ftp.arpa.mil/ResearchAreas/NETS/Internet.html Title{84}: IRTF Research Group on Resource Discovery IRTF Research Group on Resource Discovery Keywords{121}: advanced agency bowman danzig darren hardy harvest manber mic michael peter problems projects research schwartz talk udi } </verb></tscreen> Notice that although the Gatherer configuration file lists only 2 URLs (one in the RootNode section and one in the LeafNode section), there are more than 2 content summaries in the Gatherer's database. The Gatherer expanded the RootNode URL into dozens of LeafNode URLs by recursively extracting the links from the HTML file at the RootNode <em>http://harvest.cs.colorado.edu/</em>. Then, for each LeafNode given to the Gatherer, it generated a content summary for it as in the above example summary for <em>http://harvest.cs.colorado.edu/~schwartz/IRTF.html</em>. The HTML summarizer will extract structured information about the Author and Title of the file. It will also extract any URL links into the <em>URL-References</em> attribute, and any anchor tags into the <em>Partial-Text</em> attribute. Other information about the HTML file such as its MD5 (see <htmlurl url="http://www.ietf.org/rfc/rfc1321.txt" name="RFC1321">) and its size (<em>File-Size</em>) in bytes are also added to the content summary. <sect1>Example 2 - Incorporating manually generated information <label id="Example 2"> <p> The Gatherer is able to ``explode'' a resource into a stream of content summaries. This is useful for files that contain manually-generated information that may describe one or more resources, or for building a gateway between various structured formats and SOIF (see Section <ref id="The Summary Object Interchange Format (SOIF)" name="The Summary Object Interchange Format (SOIF)">). This example demonstrates an exploder for the Linux Software Map (LSM) format. LSM files contain structured information (like the author, location, etc.) about software available for the Linux operating system. To run this example, type: <tscreen><verb> % cd $HARVEST_HOME/gatherers/example-2 % ./RunGatherer </verb></tscreen> To view the configuration file for this Gatherer, look at <em>example-2.cf</em>. Notice that the Gatherer has its own <em>Lib-Directory</em> (see Section <ref id="Setting variables in the Gatherer configuration file" name="Setting variables in the Gatherer configuration file"> for help on writing configuration files). The library directory contains the typing and candidate selection customizations for Essence. In this example, we've only customized the candidate selection step. <em>lib/stoplist.cf</em> defines the types that Essence should not index. This example uses an empty <em>stoplist.cf</em> file to direct Essence to index all files. The Gatherer retrieves each of the LeafNode URLs, which are all Linux Software Map files from the Linux FTP archive <em>tsx-11.mit.edu</em>. The Gatherer recognizes that a ``.lsm'' file is <em>LSM</em> type because of the naming heuristic present in <em>lib/byname.cf</em>. The <em>LSM</em> type is a ``nested'' type as specified in the Essence source code (<em>src/gatherer/essence/unnest.c</em>). Exploder programs (named <tt>TypeName.unnest</tt>) are run on nested types rather than the usual summarizers. The <tt>LSM.unnest</tt> program is the standard exploder program that takes an <em>LSM</em> file and generates one or more corresponding SOIF objects. When the Gatherer finishes, it contains one or more corresponding SOIF objects for the software described within each <em>LSM</em> file. After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type: <tscreen><verb> % gather localhost 9222 | more </verb></tscreen> Because <em>tsx-11.mit.edu</em> is a popular and heavily loaded archive, the Gatherer often won't be able to retrieve the LSM files. If you suspect that something went wrong, look in <em>log.errors</em> and <em>log.gatherer</em> to try to determine the problem. The following two SOIF objects were generated by this Gatherer. The first object is summarizes the <em>LSM</em> file itself, and the second object summarizes the software described in the <em>LSM</em> file. <tscreen><verb> @FILE { ftp://tsx-11.mit.edu/pub/linux/docs/linux-doc-project/man-pages-1.4.lsm Time-to-Live{7}: 9676800 Last-Modification-Time{9}: 781931042 Refresh-Rate{7}: 2419200 Gatherer-Name{25}: Example Gatherer Number 2 Gatherer-Host{22}: powell.cs.colorado.edu Gatherer-Version{3}: 0.4 Type{3}: LSM Update-Time{9}: 781931042 File-Size{3}: 848 MD5{32}: 67377f3ea214ab680892c82906081caf } @FILE { ftp://ftp.cs.unc.edu/pub/faith/linux/man-pages-1.4.tar.gz Time-to-Live{7}: 9676800 Last-Modification-Time{9}: 781931042 Refresh-Rate{7}: 2419200 Gatherer-Name{25}: Example Gatherer Number 2 Gatherer-Host{22}: powell.cs.colorado.edu Gatherer-Version{3}: 0.4 Update-Time{9}: 781931042 Type{16}: GNUCompressedTar Title{48}: Section 2, 3, 4, 5, 7, and 9 man pages for Linux Version{3}: 1.4 Description{124}: Man pages for Linux. Mostly section 2 is complete. Section 3 has over 200 man pages, but it still far from being finished. Author{27}: Linux Documentation Project AuthorEmail{11}: DOC channel Maintainer{9}: Rik Faith MaintEmail{16}: faith@cs.unc.edu Site{45}: ftp.cs.unc.edu sunsite.unc.edu tsx-11.mit.edu Path{94}: /pub/faith/linux /pub/Linux/docs/linux-doc-project/man-pages /pub/linux/docs/linux-doc-project File{20}: man-pages-1.4.tar.gz FileSize{4}: 170k CopyPolicy{47}: Public Domain or otherwise freely distributable Keywords{10}: man pages Entered{24}: Sun Sep 11 19:52:06 1994 EnteredBy{9}: Rik Faith CheckedEmail{16}: faith@cs.unc.edu } </verb></tscreen> We've also built a Gatherer that explodes about a half-dozen index files from various PC archives into more than 25,000 content summaries. Each of these index files contain hundreds of a one-line descriptions about PC software distributions that are available via anonymous FTP. <sect1>Example 3 - Customizing type recognition and candidate selection <label id="Example 3"> <p> This example demonstrates how to customize the type recognition and candidate selection steps in the Gatherer (see Section <ref id="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps" name="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps">). This Gatherer recognizes World Wide Web home pages, and is configured only to collect indexing information from these home pages. To run this example, type: <tscreen><verb> % cd $HARVEST_HOME/gatherers/example-3 % ./RunGatherer </verb></tscreen> To view the configuration file for this Gatherer, look at <em>example-3.cf</em>. As in Section <ref id="Example 2" name="Example 2">, this Gatherer has its own library directory that contains a customization for Essence. Since we're only interested in indexing home pages, we need only define the heuristics for recognizing home pages. As shown below, we can use URL naming heuristics to define a home page in <em>lib/byurl.cf</em>. We've also added a default <em>Unknown</em> type to make candidate selection easier in this file. <tscreen><verb> HomeHTML ^http:.*/$ HomeHTML ^http:.*[hH]ome\.html$ HomeHTML ^http:.*[hH]ome[pP]age\.html$ HomeHTML ^http:.*[wW]elcome\.html$ HomeHTML ^http:.*/index\.html$ </verb></tscreen> The <em>lib/stoplist.cf</em> configuration file contains a list of types not to index. In this example, <em>Unknown</em> is the only type name listed in stoplist.configuration, so the Gatherer will only reject files of the <em>Unknown</em> type. You can also recognize URLs by their filename (in <em>byname.cf</em>) or by their content (in <em>bycontent.cf</em> and <em>magic</em>); although in this example, we don't need to use those mechanisms. The default <tt>HomeHTML.sum</tt> summarizer summarizes each <em>HomeHTML</em> file. After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. You'll notice that only content summaries for HomeHTML files are present. To view the content summaries, type: <tscreen><verb> % gather localhost 9333 | more </verb></tscreen> <sect1>Example 4 - Customizing type recognition and summarizing <label id="Example 4"> <p> This example demonstrates how to customize the type recognition and summarizing steps in the Gatherer (see Section <ref id="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps" name="Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps">. This Gatherer recognizes two new file formats and summarizes them appropriately. To view the configuration file for this Gatherer, look at <em>example-4.cf</em>. As in the examples in <ref id="Example 2" name="Example 2"> and <ref id="Example 3" name="Example 3">, this Gatherer has its own library directory that contains the configuration files for Essence. The Essence configuration files are the same as the default customization, except for <em>lib/byname.cf</em> which contains two customizations for the new file formats. <sect2>Using regular expressions to summarize a format <p> The first new format is the ``ReferBibliographic'' type which is the format that the <tt>refer</tt> program uses to represent bibliography information. To recognize that a file is in this format, we'll use the convention that the filename ends in ``.referbib''. So, we add that naming heuristic as a type recognition customization. Naming heuristics are represented as a regular expression against the filename in the <em>lib/byname.cf</em> file: <tscreen><verb> ReferBibliographic ^.*\.referbib$ </verb></tscreen> Now, to write a summarizer for this type, we'll need a sample ReferBibliographic file: <tscreen><verb> %A A. S. Tanenbaum %T Computer Networks %I Prentice Hall %C Englewood Cliffs, NJ %D 1988 </verb></tscreen> Essence summarizers extract structured information from files. One way to write a summarizer is by using regular expressions to define the extractions. For each type of information that you want to extract from a file, add the regular expression that will match lines in that file to <em>lib/quick-sum.cf</em>. For example, the following regular expressions in <em>lib/quick-sum.cf</em> will extract the author, title, date, and other information from ReferBibliographic files: <tscreen><verb> ReferBibliographic Author ^%A[ \t]+.*$ ReferBibliographic City ^%C[ \t]+.*$ ReferBibliographic Date ^%D[ \t]+.*$ ReferBibliographic Editor ^%E[ \t]+.*$ ReferBibliographic Comments ^%H[ \t]+.*$ ReferBibliographic Issuer ^%I[ \t]+.*$ ReferBibliographic Journal ^%J[ \t]+.*$ ReferBibliographic Keywords ^%K[ \t]+.*$ ReferBibliographic Label ^%L[ \t]+.*$ ReferBibliographic Number ^%N[ \t]+.*$ ReferBibliographic Comments ^%O[ \t]+.*$ ReferBibliographic Page-Number ^%P[ \t]+.*$ ReferBibliographic Unpublished-Info ^%R[ \t]+.*$ ReferBibliographic Series-Title ^%S[ \t]+.*$ ReferBibliographic Title ^%T[ \t]+.*$ ReferBibliographic Volume ^%V[ \t]+.*$ ReferBibliographic Abstract ^%X[ \t]+.*$ </verb></tscreen> The first field in <em>lib/quick-sum.cf</em> is the name of the type. The second field is the Attribute under which to extract the information on lines that match the regular expression in the third field. <sect2>Using programs to summarize a format <p> The second new file format is the ``Abstract'' type, which is a file that contains only the text of a paper abstract (a format that is common in technical report FTP archives). To recognize that a file is written in this format, we'll use the naming convention that the filename for ``Abstract'' files ends in ``.abs''. So, we add that type recognition customization to the <em>lib/byname.cf</em> file as a regular expression: <tscreen><verb> Abstract ^.*\.abs$ </verb></tscreen> Another way to write a summarizer is to write a program or script that takes a filename as the first argument on the command line, extracts the structured information, then outputs the results as a list of SOIF attribute-value pairs. Summarizer programs are named <tt>TypeName.sum</tt>, so we call our new summarizer <tt>Abstract.sum</tt>. Remember to place the summarizer program in a directory that is in your path so that Gatherer can run it. You'll see below that <tt>Abstract.sum</tt> is a Bourne shell script that takes the first 50 lines of the file, wraps it as the ``Abstract'' attribute, and outputs it as a SOIF attribute-value pair. <tscreen><verb> #!/bin/sh # # Usage: Abstract.sum filename # head -50 "$1" | wrapit "Abstract" </verb></tscreen> <sect2>Running the example <p> To run this example, type: <tscreen><verb> % cd $HARVEST_HOME/gatherers/example-4 % ./RunGatherer </verb></tscreen> After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type: <tscreen><verb> % gather localhost 9444 | more </verb></tscreen> <sect1>Example 5 - Using RootNode filters <p> This example demonstrates how to use RootNode filters to customize the candidate selection in the Gatherer (see Section <ref id="RootNode filters" name="RootNode filters">). Only items that pass RootNode filters will be retrieved across the network (see Section <ref id="Gatherer enumeration vs. candidate selection" name="Gatherer enumeration vs. candidate selection">). To run this example, type: <tscreen><verb> % cd $HARVEST_HOME/gatherers/example-5 % ./RunGatherer </verb></tscreen> After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type: <tscreen><verb> % gather localhost 9555 | more </verb></tscreen> <sect>History of Harvest <p> <sect1>History of Harvest <p> <itemize> <item>1996-01-31: Harvest 1.4pl2 was the last official release by Darren R. Hardy, Michael F. Schwartz, and Duane Wessels. <item>1997-04-21: Harvest 1.5 was released by Simon Wilkinson. <item>1998-06-12: Harvest 1.5.20 was released by Simon Wilkinson. <item>1999-05-26: Harvest-MathNet100.tar.gz released. <item>2000-01-14: harvest-modified-by-RL-Stajsic.tar.gz released. <item>2000-02-07: Harvest 1.6.1 was released by Kang-Jin Lee in cooperation with Simon Wilkinson. <item>2002-10-25: Harvest 1.8.0 was released by Harald Weinreich and Kang-Jin Lee. </itemize> <sect1>History of Harvest User's Manual <p> <itemize> <item>1996-01-31: Harvest User's Manual for Harvest 1.4.pl2 was written by Darren R. Hardy, Michael F. Schwartz, and Duane Wessels. The document was written in LaTeX. The HTML (converted with LaTeX2HTML) and the Postscript versions were made available to the public. <item>2001-04-27: The HTML version of this document was updated and bundled with the Harvest distribution by Kang-Jin Lee. Notable changes were removing the sections about Harvest Object Cache and the Replicator which are not part of Harvest any more. <item>2002-01-28: This Harvest User's Manual was converted to linuxdoc. It is now available in PostScript, PDF, text and HTML format. </itemize> </article>