Harvest User's Manual Darren R. Hardy, Michael F. Schwartz, Duane Wessels, Kang- Jin Lee 2002-10-29 Harvest User's Manual was edited by Kang-Jin Lee and covers Harvest version 1.8. It was originally written by Darren R. Hardy, Michael F. Schwartz and Duane Wessels for Harvest 1.4.pl2 in 1996-01-31. ______________________________________________________________________ Table of Contents 1. Introduction to Harvest 1.1 Copyright 1.2 Online Harvest Resources 2. Subsystem Overview 2.1 Distributing the Gathering and Brokering Processes 3. Installing the Harvest Software 3.1 Requirements for Harvest Servers 3.1.1 Hardware 3.1.2 Platforms 3.1.3 Software 3.2 Requirements for Harvest Users 3.3 Retrieving and Installing the Harvest Software 3.3.1 Distribution types 3.3.2 Harvest components 3.3.3 User-contributed software 3.4 Building the Source Distribution 3.5 Additional installation for the Harvest Broker 3.5.1 Checking the installation for HTTP access 3.5.2 Required modifications to your HTTP server 3.5.3 Apache httpd 3.5.4 Other HTTP servers 3.6 Upgrading versions of the Harvest software 3.6.1 Upgrading from version 1.6 to version 1.8 3.6.2 Upgrading from version 1.5 to version 1.6 3.6.3 Upgrading from version 1.4 to version 1.5 3.6.4 Upgrading from version 1.3 to version 1.4 3.6.5 Upgrading from version 1.2 to version 1.3 3.6.6 Upgrading from version 1.1 to version 1.2 3.6.7 Upgrading to version 1.1 from version 1.0 or older 3.7 Starting up the system: RunHarvest and related commands 3.8 Harvest team contact information 4. The Gatherer 4.1 Overview 4.2 Basic setup 4.2.1 Gathering News URLs with NNTP 4.2.2 Cleaning out a Gatherer 4.3 RootNode specifications 4.3.1 RootNode filters 4.3.2 Generic Enumeration program description 4.3.3 Example RootNode configuration 4.3.4 Gatherer enumeration vs. candidate selection 4.4 Generating LeafNode/RootNode URLs from a program 4.5 Extracting data for indexing: The Essence summarizing subsystem 4.5.1 Default actions of ``stock'' summarizers 4.5.2 Summarizing SGML data 4.5.2.1 Location of support files 4.5.2.2 The SGML to SOIF table 4.5.2.3 Errors and warnings from the SGML Parser 4.5.2.4 Creating a summarizer for a new SGML-tagged data type 4.5.2.5 The SGML-based HTML summarizer 4.5.2.6 Adding META data to your HTML 4.5.2.7 Other examples 4.5.3 Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps 4.5.3.1 Customizing the type recognition step 4.5.3.2 Customizing the candidate selection step 4.5.3.3 Customizing the presentation unnesting step 4.5.3.4 Customizing the summarizing step 4.6 Post-Summarizing: Rule-based tuning of object summaries 4.6.1 The Rules file 4.6.2 Rewriting URLs 4.7 Gatherer administration 4.7.1 Setting variables in the Gatherer configuration file 4.7.2 Local file system gathering for reduced CPU load 4.7.3 Gathering from password-protected servers 4.7.4 Controlling access to the Gatherer's database 4.7.5 Periodic gathering and realtime updates 4.7.6 The local disk cache 4.7.7 Incorporating manually generated information into a Gatherer 4.8 Troubleshooting 5. The Broker 5.1 Overview 5.2 Basic setup 5.3 Querying a Broker 5.3.1 Example queries 5.3.2 Regular expressions 5.3.3 Query options selected by menus or buttons 5.3.4 Filtering query results 5.3.5 Result set presentation 5.4 Customizing the Broker's Query Result Set 5.4.1 The search.cf configuration file 5.4.1.1 Defined Variables 5.4.1.2 List of Definitions 5.4.2 Example search.cf customization file 5.4.3 Integrating your customized configuration file 5.4.4 Displaying SOIF attributes in results 5.5 World Wide Web interface description 5.5.1 HTML files for graphical user interface 5.5.2 CGI programs 5.5.3 Help files for the user 5.6 Administrating a Broker 5.6.1 Deleting unwanted Broker objects 5.6.2 Command-line Administration 5.7 Tuning Glimpse indexing in the Broker 5.7.1 The glimpseserver program 5.8 Using different index/search engines with the Broker 5.8.1 Using Swish as an indexer 5.8.2 Using WAIS as an indexer 5.9 Collector interface description: Collection.conf 5.10 Troubleshooting 6. Programs and layout of the installed Harvest software 6.1 $HARVEST_HOME 6.2 $HARVEST_HOME/bin 6.3 $HARVEST_HOME/brokers 6.4 $HARVEST_HOME/cgi-bin 6.5 $HARVEST_HOME/gatherers 6.6 $HARVEST_HOME/lib 6.7 $HARVEST_HOME/lib/broker 6.8 $HARVEST_HOME/lib/gatherer 6.9 $HARVEST_HOME/tmp 7. The Summary Object Interchange Format (SOIF) 7.1 Formal description of SOIF 7.2 List of common SOIF attribute names 8. Gatherer Examples 8.1 Example 1 - A simple Gatherer 8.2 Example 2 - Incorporating manually generated information 8.3 Example 3 - Customizing type recognition and candidate selection 8.4 Example 4 - Customizing type recognition and summarizing 8.4.1 Using regular expressions to summarize a format 8.4.2 Using programs to summarize a format 8.4.3 Running the example 8.5 Example 5 - Using RootNode filters 9. History of Harvest 9.1 History of Harvest 9.2 History of Harvest User's Manual ______________________________________________________________________ 11.. IInnttrroodduuccttiioonn ttoo HHaarrvveesstt HARVEST is an integrated set of tools to gather, extract, organize, and search information across the Internet. With modest effort users can tailor Harvest to digest information in many different formats, and offer custom search services on the Internet. A key goal of Harvest is to provide a flexible system that can be configured in various ways to create many types of indexes. Harvest also allows users to extract structured (attribute-value pair) information from many different information formats and build indexes that allow these attributes to be referenced during queries (e.g., searching for all documents with a certain regular expression in the title field). An important advantage of Harvest is that it allows users to build indexes using either manually constructed templates (for maximum control over index content) or automatically extracted data constructed templates (for easy coverage of large collections), or using a hybrid of the two methods. Harvest is designed to make it easy to distribute the search system on a pool of networked machines to handle higher load. 11..11.. CCooppyyrriigghhtt The core of Harvest is licensed under GPL <../../COPYING>. Additional components distributed with Harvest are also under GPL or similar license. Glimpse, the current default fulltext indexer has a different license. Here is a clarification of Glimpse' copyright status <../glimpse-license-status> kindly posted by Golda Velez to comp.infosystems.harvest . 11..22.. OOnnlliinnee HHaarrvveesstt RReessoouurrcceess This manual is available at harvest.sourceforge.net/harvest/doc/html/manual.html. More information about Harvest is available at harvest.sourceforge.net. 22.. SSuubbssyysstteemm OOvveerrvviieeww Harvest consists of several subsystems. The _G_a_t_h_e_r_e_r subsystem collects indexing information (such as keywords, author names, and titles) from the resources available at _P_r_o_v_i_d_e_r sites (such as FTP and HTTP servers). The _B_r_o_k_e_r subsystem retrieves indexing information from one or more Gatherers, suppresses duplicate information, incrementally indexes the collected information, and provides a WWW query interface to it. Harvest Software Components You should start using Harvest simply, by installing a single ``stock'' (i.e., not customized) Gatherer and Broker on one machine to index some of the FTP, World Wide Web, and NetNews data at your site. After you get the system working in this basic configuration, you can invest additional effort as warranted. First, as you scale up to index larger volumes of information, you can reduce the CPU and network load to index your data by distributing the gathering process. Second, you can customize how Harvest extracts, indexes, and searches your information, to better match the types of data you have and the ways your users would like to interact with the data. We discuss how to distribute the gathering process in the next subsection. We cover various forms of customization in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps'' and in several parts of Section ``The Broker''. 22..11.. DDiissttrriibbuuttiinngg tthhee GGaatthheerriinngg aanndd BBrrookkeerriinngg PPrroocceesssseess Harvest Gatherers and Brokers can be configured in various ways. Running a Gatherer remotely from a Provider site allows Harvest to interoperate with sites that are not running Harvest Gatherers, by using standard object retrieval protocols like FTP, Gopher, HTTP, and NNTP. However, as suggested by the bold lines in the left side of Figure ``2'', this arrangement results in excess server and network load. Running a Gatherer locally is much more efficient, as shown in the right side of Figure ``2''. Nonetheless, running a Gatherer remotely is still better than having many sites independently collect indexing information, since many Brokers or other search services can share the indexing information that the Gatherer collects. If you have a number of FTP/HTTP/Gopher/NNTP servers at your site, it is most efficient to run a Gatherer on each machine where these servers run. On the other hand, you can reduce installation effort by running a Gatherer at just one machine at your site and letting it retrieve data from across the network. Harvest Configuration Options Figure ``2'' also illustrates that a Broker can collect information from many Gatherers (to build an index of widely distributed information). Brokers can also retrieve information from other Brokers, in effect cascading indexed views from one another. Brokers retrieve this information using the query interface, allowing them to filter or refine the information from one Broker to the next. 33.. IInnssttaalllliinngg tthhee HHaarrvveesstt SSooffttwwaarree 33..11.. RReeqquuiirreemmeennttss ffoorr HHaarrvveesstt SSeerrvveerrss 33..11..11.. HHaarrddwwaarree A good machine for running a typical Harvest server will have a reasonably fast processor, 1-2 GB of free disk, and 128 MB of RAM. A slower CPU will work but it will slow down the Harvest server. More important than CPU speed, however, is memory size. Harvest uses a number of processes, some of which provide needed ``plumbing'' (e.g., search.cgi), and some of which improve performance (e.g., the glimpseserver process). If you do not have enough memory, your system will page too much, and drastically reduce performance. The other factor affecting RAM usage is how much data you are trying to index in a Harvest Broker. The more data, the more disk I/O will be performed at query time, the more RAM it will take to provide a reasonable sized disk buffer pool. The amount of disk you'll need depends on how much data you want to index in a single Broker. (It is possible to distribute your index over multiple Brokers if it gets too large for one disk.) A good rule of thumb is that you will need about 10% as much disk to hold the Gatherer and Broker databases as the total size of the data you want to index. The actual space needs will vary depending on the type of data you are indexing. For example, PostScript achieves a much higher indexing space reduction than HTML, because so much of the PostScript data (such as page positioning information) is discarded when building the index. 33..11..22.. PPllaattffoorrmmss To run a Harvest server, you need an UNIX-like Operating System. 33..11..33.. SSooffttwwaarree To use Harvest, you need the following software packages: +o All Harvest servers require: Perl v5.0 or higher. +o The Harvest Broker and Gatherer require: GNU gzip v1.2.4 or higher. +o The Harvest Broker requires: HTTP server. To build Harvest from the source distribution you may need to install one or more of the following software packages: +o Compiling Harvest requires: GNU gcc v2.5.8 or higher. +o Compiling the Harvest Broker requires: flex v2.4.7 or higher and bison v1.22 or higher. The sources for gcc, gzip, flex, and bison are available at the GNU FTP server . 33..22.. RReeqquuiirreemmeennttss ffoorr HHaarrvveesstt UUsseerrss Anyone with a web browser (e.g., Internet Explorer, Lynx, Mozilla, Netscape, Opera, etc.) can access and use Harvest servers. 33..33.. RReettrriieevviinngg aanndd IInnssttaalllliinngg tthhee HHaarrvveesstt SSooffttwwaarree 33..33..11.. DDiissttrriibbuuttiioonn ttyyppeess Currently we offer only source distribution of Harvest. The _s_o_u_r_c_e _d_i_s_t_r_i_b_u_t_i_o_n contains all of the source code for the Harvest software. There are no _b_i_n_a_r_y _d_i_s_t_r_i_b_u_t_i_o_n_s of Harvest. You can retrieve the Harvest source distributions from the Harvest download site prdownloads.sourceforge.net/harvest/. 33..33..22.. HHaarrvveesstt ccoommppoonneennttss Harvest components are in the _c_o_m_p_o_n_e_n_t_s directory. To use a component, follow the instructions included in the desired component directory. 33..33..33.. UUsseerr--ccoonnttrriibbuutteedd ssooffttwwaarree There is a collection of unsupported user-contributed software in _c_o_n_t_r_i_b directory. If you would like to contribute some software, please send email to lee@arco.de . 33..44.. BBuuiillddiinngg tthhee SSoouurrccee DDiissttrriibbuuttiioonn The source distribution can be extracted in any directory. The following command will extract the gnu-zipped source archive: % gzip -dc harvest-x.y.z.tar.gz | tar xf - For archives compressed with bzip2, use: % bzip2 -dc harvest-x.y.z.tar.bz2 | tar xf - Harvest uses GNU's _a_u_t_o_c_o_n_f package to perform needed configuration at installation time. If you want to override the default installation location of _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t, change the ``prefix'' variable when invoking ``configure''. If desired, you may edit _s_r_c_/_c_o_m_m_o_n_/_i_n_c_l_u_d_e_/_c_o_n_f_i_g_._h before compiling to change various Harvest compile-time limits and variables. To compile the source tree type make. For example, to build and install the entire Harvest system into _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t directory, type: % ./configure % make % make install You may see some compiler warning messages, which you can ignore. Building the entire Harvest distribution will take few minutes on a reasonably fast machine. The compiled source tree takes approximately 25 megabytes of disk space. Later, after the installed software working, you can remove the compiled code (``.o'' files) and other intermediate files by typing make clean. If you want to remove the configure-generated Makefiles, type make distclean. 33..55.. AAddddiittiioonnaall iinnssttaallllaattiioonn ffoorr tthhee HHaarrvveesstt BBrrookkeerr 33..55..11.. CChheecckkiinngg tthhee iinnssttaallllaattiioonn ffoorr HHTTTTPP aacccceessss The Broker interacts with your HTTP server in a number of ways. You should make sure that the HTTP server can properly access the files it needs. In many cases, the HTTP server will run under a different userid than the owner of the Harvest files. First, make sure the HTTP server userid can read the _q_u_e_r_y_._h_t_m_l files in each broker directory. Second, make sure the HTTP server userid can access and execute the CGI programs in _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-_b_i_n_/. The search.cgi script reads files from the _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-_b_i_n_/_l_i_b_/ directory, so check that as well. Finally, check the files in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/. Some of the CGI Perl scripts require ``include'' files in this directory. 33..55..22.. RReeqquuiirreedd mmooddiiffiiccaattiioonnss ttoo yyoouurr HHTTTTPP sseerrvveerr The Harvest Broker requires that an HTTP server is running, and that the HTTP server ``knows'' about the Broker's files. Below are some examples of how to configure various HTTP servers to work with the Harvest Broker. 33..55..33.. AAppaacchhee hhttttppdd Requires a SSccrriippttAAlliiaass and an AAlliiaass entry in _h_t_t_p_d_._c_o_n_f, e.g.: ScriptAlias /Harvest/cgi-bin/ Your-HARVEST_HOME/cgi-bin/ Alias /Harvest/ Your-HARVEST_HOME/ _W_A_R_N_I_N_G_: The SSccrriippttAAlliiaass entry must appear _b_e_f_o_r_e the AAlliiaass entry. Additionally, it might be necessary to configure Apache httpd to follow _s_y_m_b_o_l_i_c _l_i_n_k_s. To do this, add following to your _h_t_t_p_d_._c_o_n_f: Options FollowSymLinks 33..55..44.. OOtthheerr HHTTTTPP sseerrvveerrss Install the HTTP server and modify its configuration file so that the _/_H_a_r_v_e_s_t directory points to _$_H_A_R_V_E_S_T___H_O_M_E. You will also need to configure your HTTP server so that it knows that the directory _/_H_a_r_v_e_s_t_/_c_g_i_-_b_i_n contains valid CGI programs. If the default behaviour of your HTTP server is not to follow symbolik links, you will need to configure it so that it will follow symbolic links in the _/_H_a_r_v_e_s_t directory. 33..66.. UUppggrraaddiinngg vveerrssiioonnss ooff tthhee HHaarrvveesstt ssooffttwwaarree 33..66..11.. UUppggrraaddiinngg ffrroomm vveerrssiioonn 11..66 ttoo vveerrssiioonn 11..88 You _c_a_n _n_o_t install version 1.8 on top of version 1.6. For example, the change from version 1.6 to version 1.8 included some reorganization of the executables, and hence simply installing version 1.8 on top of version 1.6 would cause you to use old executables in some cases. To upgrade from Harvest version 1.6 to 1.8, do: 1. Move your old installation to a temporary location. 2. Install the new version as directed by the release notes. 3. Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation. GGaatthheerreerrss:: you need to move the Gatherer's directory into _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s. Section ``RootNode specifications'' describes the Gatherer workload specifications if you want to modify your Gatherer's configuration file. BBrrookkeerrss:: rebuild your broker by using CreateBroker and merge in any customizations you have made to your old Broker. 33..66..22.. UUppggrraaddiinngg ffrroomm vveerrssiioonn 11..55 ttoo vveerrssiioonn 11..66 There are no known incompatibilities between versions 1.5 and 1.6. 33..66..33.. UUppggrraaddiinngg ffrroomm vveerrssiioonn 11..44 ttoo vveerrssiioonn 11..55 You _c_a_n _n_o_t install version 1.5 on top of version 1.4. For example, the change from version 1.4 to version 1.5 included some reorganization of the executables, and hence simply installing version 1.5 on top of version 1.4 would cause you to use old executables in some cases. To upgrade from Harvest version 1.4 to 1.5, do: 1. Move your old installation to a temporary location. 2. Install the new version as directed by the release notes. 3. Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation. GGaatthheerreerrss:: you need to move the Gatherer's directory into _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s. Section ``RootNode specifications'' describes the Gatherer workload specifications if you want to modify your Gatherer's configuration file. BBrrookkeerrss:: you need to move the Broker's directory into _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s. Remove any _._g_l_i_m_p_s_e___* files from your Broker's directory and use the _a_d_m_i_n_._h_t_m_l interface to force a full-index. You may want, however, to rebuild your broker by using CreateBroker so that you can use the updated _q_u_e_r_y_._h_t_m_l and related files. 33..66..44.. UUppggrraaddiinngg ffrroomm vveerrssiioonn 11..33 ttoo vveerrssiioonn 11..44 There are no known incompatibilities between versions 1.3 and 1.4. 33..66..55.. UUppggrraaddiinngg ffrroomm vveerrssiioonn 11..22 ttoo vveerrssiioonn 11..33 Version 1.3 is mostly backwards compatible with 1.2, with the following exception: Harvest 1.3 uses Glimpse 3.0. The _._g_l_i_m_p_s_e___* files in the broker directory created with Harvest 1.2 (Glimpse 2.0) are incompatible. After installing Harvest 1.3 you should: 1. Shutdown any running brokers. 2. Execute rm .glimpse_* in each broker directory. 3. Restart your brokers with RunBroker. 4. Force a full-index from the _a_d_m_i_n_._h_t_m_l interface. 33..66..66.. UUppggrraaddiinngg ffrroomm vveerrssiioonn 11..11 ttoo vveerrssiioonn 11..22 There are a few incompatabilities between Harvest version 1.1 and version 1.2. +o The Gatherer has improved incremental gatherering support which is incompatible with version 1.1. To update your existing Gatherer, change into the Gatherer's _D_a_t_a_-_D_i_r_e_c_t_o_r_y (usually the _d_a_t_a subdirectory), and run the following command: % set path = ($HARVEST_HOME/lib/gatherer $path) % cd data % rm -f INDEX.gdbm % mkindex This should create the _I_N_D_E_X_._g_d_b_m and _M_D_5_._g_d_b_m files in the current directory. +o The Broker has a new log format for the _a_d_m_i_n_/_L_O_G file which is incompatible with version 1.1. 33..66..77.. UUppggrraaddiinngg ttoo vveerrssiioonn 11..11 ffrroomm vveerrssiioonn 11..00 oorr oollddeerr If you already have an older version of Harvest installed, and want to upgrade, you _c_a_n _n_o_t unpack the new distribution on top of the old one. For example, the change from version 1.0 to version 1.1 included some reorganization of the executables, and hence simply installing version 1.1 on top of version 1.0 would cause you to use old executables in some cases. On the other hand, you may not want to start over from scratch with a new software version, as that would not take advantage of the data you have already gathered and indexed. Instead, to upgrade from Harvest version 1.0 to 1.1, do the following: 1. Move your old installation to a temporary location. 2. Install the new version as directed by the release notes. 3. Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation. GGaatthheerreerrss:: you need to move the Gatherer's directory into _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s. Section ``RootNode specifications'' describes the new Gatherer workload specifications which were introduced in version 1.1; you may modify your Gatherer's configuration file to employ this new functionality. BBrrookkeerrss:: you need to move the Broker's directory into _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s. You may want, however, to rebuild your broker by using CreateBroker so that you can use the updated _q_u_e_r_y_._h_t_m_l and related files. 33..77.. SSttaarrttiinngg uupp tthhee ssyysstteemm:: RRuunnHHaarrvveesstt aanndd rreellaatteedd ccoommmmaannddss The simplest way to start the Harvest system is to use the RunHarvest command. RunHarvest prompts the user with a short list of questions about what data to index, etc., and then creates and runs a Gatherer and Broker with a ``stock'' (non-customized) set of content extraction and indexing mechanisms. Some more primitive commands are also available, for starting individual Gatherers and Brokers (e.g., if you want to distribute the gathering process). The Harvest startup commands are: RRuunnHHaarrvveesstt Checks that the Harvest software is installed correctly, prompts the user for basic configuration information, and then creates and runs a Gatherer and a Broker. If you have _$_H_A_R_V_E_S_T___H_O_M_E set, then it will use it; otherwise, it tries to determine _$_H_A_R_V_E_S_T___H_O_M_E automatically. Found in the _$_H_A_R_V_E_S_T___H_O_M_E directory. RRuunnBBrrookkeerr Runs a Broker. Found in the Broker's directory. RRuunnGGaatthheerreerr Runs a Gatherer. Found in the Gatherer's directory. CCrreeaatteeBBrrookkeerr Creates a single Broker which will collect its information from other existing Brokers or Gatherers. Used by RunHarvest, or can be run by a user to create a new Broker. Uses _$_H_A_R_V_E_S_T___H_O_M_E, and defaults to _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t. Found in the _$_H_A_R_V_E_S_T___H_O_M_E_/_b_i_n directory. There is no CreateGatherer command, but the RunHarvest command can create a Gatherer, or you can create a Gatherer manually (see Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps'' or Section ``Gatherer Examples''). The layout of the installed Harvest directories and programs is discussed in Section ``Programs and layout of the installed Harvest software''. Among other things, the RunHarvest command asks the user what port numbers to use when running the Gatherer and the Broker. By default, the Gatherer will use port 8500 and the Broker will use the Gatherer port plus 1. The choice of port numbers depends on your particular machine -- you need to choose ports that are not in use by other servers on your machine. You might look at your _/_e_t_c_/_s_e_r_v_i_c_e_s file to see what ports are in use (although this file only lists some servers; other servers use ports without registering that information anywhere). Usually the above port numbers will not be in use by other processes. Probably the easiest thing is simply to try using the default port numbers, and see if it works. The remainder of this manual provides information for users who wish to customize or otherwise make more sophisticated use of Harvest than what happens when you install the system and run RunHarvest. 33..88.. HHaarrvveesstt tteeaamm ccoonnttaacctt iinnffoorrmmaattiioonn If you have questions the about Harvest system or problems with the software, post a note to the USENET newsgroup comp.infosystems.harvest . Please note your machine type, operating system type, and Harvest version number in your correspondence. If you have bug fixes, ports to new platforms or other software improvements, please email them to the Harvest maintainer lee@arco.de . 44.. TThhee GGaatthheerreerr 44..11.. OOvveerrvviieeww The Gatherer retrieves information resources using a variety of standard access methods (FTP, Gopher, HTTP, NNTP, and local files), and then summarizes those resources in various type-specific ways to generate structured indexing information. For example, a Gatherer can retrieve a technical report from an FTP archive, and then extract the author, title, and abstract from the paper to summarize the technical report. Harvest Brokers or other search services can then retrieve the indexing information from the Gatherer to use in a searchable index available via a WWW interface. The Gatherer consists of a number of separate components. The Gatherer program reads a Gatherer configuration file and controls the overall process of enumerating and summarizing data objects. The structured indexing information that the Gatherer collects is represented as a list of attribute-value pairs using the _S_u_m_m_a_r_y _O_b_j_e_c_t _I_n_t_e_r_c_h_a_n_g_e _F_o_r_m_a_t (SOIF, see Section ``The Summary Object Interchange Format (SOIF)''). The gatherd daemon serves the Gatherer database to Brokers. It hangs around, in the background, after a gathering session is complete. A stand-alone gather program is a client for the gatherd server. It can be used from the command line for testing, and is used by the Broker. The Gatherer uses a local disk cache to store objects it has retrieved. The disk cache is described in Section ``The local disk cache''. Even though the gatherd daemon remains in the background, a Gatherer does not automatically update or refresh its summary objects. Each object in a Gatherer has a Time-to-Live value. Objects remain in the database until they expire. See Section ``Periodic gathering and realtime updates'' for more information on keeping Gatherer objects up to date. Several example Gatherers are provided with the Harvest software distribution (see Section ``Gatherer Examples''). 44..22.. BBaassiicc sseettuupp To run a basic Gatherer, you need only list the Uniform Resource Locators (URLs, see RFC1630 and RFC1738) from which it will gather indexing information. This list is specified in the Gatherer configuration file, along with other optional information such as the Gatherer's name and the directory in which it resides (see Section ``Setting variables in the Gatherer configuration file'' for details on the optional information). Below is an example Gatherer configuration file: # # sample.cf - Sample Gatherer Configuration File # Gatherer-Name: My Sample Harvest Gatherer Gatherer-Port: 8500 Top-Directory: /usr/local/harvest/gatherers/sample # Enter URLs for RootNodes here http://www.mozilla.org/ http://www.xfree86.org/ # Enter URLs for LeafNodes here http://www.arco.de/~kj/index.html As shown in the example configuration file, you may classify an URL as a RRoooottNNooddee or a LLeeaaffNNooddee. For a LeafNode URL, the Gatherer simply retrieves the URL and processes it. LeafNode URLs are typically files like PostScript papers or compressed ``tar'' distributions. For a RootNode URL, the Gatherer will expand it into zero or more LeafNode URLs by recursively enumerating it in an access method-specific way. For FTP or Gopher, the Gatherer will perform a recursive directory listing on the FTP or Gopher server to expand the RootNode (typically a directory name). For HTTP, a RootNode URL is expanded by following the embedded HTML links to other URLs. For News, the enumeration returns all the messages in the specified USENET newsgroup. PLEASE BE CAREFUL when specifying RootNodes as it is possible to specify an enormous amount of work with a single RootNode URL. To help prevent a misconfigured Gatherer from abusing servers or running wildly, by default the Gatherer will only expand a RootNode into 250 LeafNodes, and will only include HTML links that point to documents that reside on the same server as the original RootNode URL. There are several options that allow you to change these limits and otherwise enhance the Gatherer specification. See Section ``RootNode specifications'' for details. The Gatherer is a ``robot'' and collects URLs starting from the URLs specified in RootNodes. It obeys the _r_o_b_o_t_s_._t_x_t convention and the _r_o_b_o_t_s _M_E_T_A _t_a_g. It also is HTTP Version 1.1 compliant and sends the _U_s_e_r_-_A_g_e_n_t and _F_r_o_m request fields to HTTP servers for accountability. After you have written the Gatherer configuration file, create a directory for the Gatherer and copy the configuration file there. Then, run the Gatherer program with the configuration file as the only command-line argument, as shown below: % Gatherer GathName.cf The Gatherer will generate a database of the content summaries, a log file (_l_o_g_._g_a_t_h_e_r_e_r), and an error log file (_l_o_g_._e_r_r_o_r_s). It will also start the gatherd daemon which exports the indexing information automatically to Brokers and other clients. To view the exported indexing information, you can use the gather client program, as shown below: % gather localhost 8500 | more The --iinnffoo option causes the Gatherer to respond only with the Gatherer summary information, which consists of the attributes available in the specified Gatherer's database, the Gatherer's host and name, the range of object update times, and the number of objects. Compression is the default, but can be disabled with the --nnooccoommpprreessss option. The optional timestamp tells the Gatherer to send only the objects that have changed since the specified timestamp (in seconds since the UNIX ``epoch'' of January 1, 1970). 44..22..11.. GGaatthheerriinngg NNeewwss UURRLLss wwiitthh NNNNTTPP News URLs are somewhat different than the other access protocols because the URL generally does not contain a hostname. The Gatherer retrieves News URLs from an NNTP server. The name of this server must be placed in the environment variable _$_N_N_T_P_S_E_R_V_E_R. It is probably a good idea to add this to your RunGatherer script. If the environment variable is not set, the Gatherer attempts to connect to a host named _n_e_w_s at your site. 44..22..22.. CClleeaanniinngg oouutt aa GGaatthheerreerr Remember the Gatherer databases persists between runs. Objects remain in the databases until they expire. When experimenting with the gatherer, it is always a good idea to ``clean out'' the databases between runs. This is most easily accomplished by executing this command from the Gatherer directory: % rm -rf data tmp log.* 44..33.. RRoooottNNooddee ssppeecciiffiiccaattiioonnss The RootNode specification facility described in Section ``Basic setup'' provides a basic set of default enumeration actions for RootNodes. Often it is useful to enumerate beyond the default limits, for example, to increase the enumeration limit beyond 250 URLs, or to allow site boundaries to be crossed when enumerating HTML links. It is possible to specify these and other aspects of enumeration, using the following syntax: URL EnumSpec URL EnumSpec ... where _E_n_u_m_S_p_e_c is on a single line (using ``\\'' to escape linefeeds), with the following syntax: URL=URL-Max[,URL-Filter-filename] \ Host=Host-Max[,Host-Filter-filename] \ Access=TypeList \ Delay=Seconds \ Depth=Number \ Enumeration=Enumeration-Program The _E_n_u_m_S_p_e_c modifiers are all optional, and have the following meanings: UURRLL--MMaaxx The number specified on the right hand side of the ``URL='' expression lists the maximum number of LeafNode URLs to generate at all levels of depth, from the current URL. Note that _U_R_L_-_M_a_x is the maximum number of URLs that are generated during the enumeration, and _n_o_t a limit on how many URLs can pass through the candidate selection phase (see Section ``Customizing the candidate selection step''). UURRLL--FFiilltteerr--ffiilleennaammee This is the name of a file containing a set of regular expression filters (see Section ``RootNode filters'') to allow or deny particular LeafNodes in the enumeration. The default filter is _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_U_R_L_-_f_i_l_t_e_r_-_d_e_f_a_u_l_t which excludes many image and sound files. HHoosstt--MMaaxx The number specified on the right hand side of the ``Host='' expression lists the maximum number of hosts that will be touched during the RootNode enumeration. This enumeration actually counts hosts by IP address so that aliased hosts are properly enumerated. Note that this does not work correctly for multi-homed hosts, or for hosts with rotating DNS entries (used by some sites for load balancing heavily accessed servers). _N_o_t_e_: Prior to Harvest Version 1.2 the ``Host=...'' line was called ``Site=...''. We changed the name to ``Host='' because it is more intuitively meaningful (being a host count limit, not a site count limit). For backwards compatibility with older Gatherer configuration files, we will continue to treat ``Site='' as an alias for ``Host=''. HHoosstt--FFiilltteerr--ffiilleennaammee This is the name of a file containing a set of regular expression filters to allow or deny particular hosts in the enumeration. Each expression can specify both a host name (or IP address) and a port number (in case you have multiple servers running on different ports of the same server and you want to index only one). The syntax is ``hostname:port''. AAcccceessss If the RootNode is an HTTP URL, then you can specify which access methods across which to enumerate. Valid access method types are: FFIILLEE,, FFTTPP,, GGoopphheerr,, HHTTTTPP,, NNeewwss,, TTeellnneett,, or WWAAIISS. Use a ``||'' character between type names to allow multiple access methods. For example, ``AAcccceessss==HHTTTTPP||FFTTPP||GGoopphheerr'' will follow HTTP, FTP, and Gopher URLs while enumerating an HTTP RootNode URL. _N_o_t_e_: We do not support cross-method enumeration from Gopher, because of the difficulty of ensuring that Gopher pointers do not cross site boundaries. For example, the Gopher URL _g_o_p_h_e_r_:_/_/_p_o_w_e_l_l_._c_s_._c_o_l_o_r_a_d_o_._e_d_u_:_7_0_0_5_/_1_f_t_p_3_a_f_t_p_._c_s_._w_a_s_h_i_n_g_t_o_n_._e_d_u_4_0_p_u_b_/ would get an FTP directory listing of ftp.cs.washington.edu:/pub, even though the host part of the URL is powell.cs.colorado.edu. DDeellaayy This is the number of seconds to wait between server contacts. It defaults to one second, when not specified otherwise. DDeellaayy==33 will let the gatherer sleep 3 seconds between server contacts. DDeepptthh This is the maximum number of levels of enumeration that will be followed during gathering. DDeepptthh==00 means that there is _n_o limit to the depth of the enumeration. DDeepptthh==11 means the specified URL will be retrieved, and all the URLs referenced by the specified URL will be retrieved; and so on for higher Depth values. In other words, the enumeration will follow links up to _D_e_p_t_h steps away from the specified URL. EEnnuummeerraattiioonn--PPrrooggrraamm This modifier adds a very flexible way to control a Gatherer. The Enumeration-Program is a filter which reads URLs as input and writes new enumeration parameters on output. See section ``Generic Enumeration program description'' for specific details. By default, _U_R_L_-_M_a_x defaults to 250, _U_R_L_-_F_i_l_t_e_r defaults to no limit, _H_o_s_t_-_M_a_x defaults to 1, _H_o_s_t_-_F_i_l_t_e_r defaults to no limit, _A_c_c_e_s_s defaults to HTTP only, _D_e_l_a_y defaults to 1 second, and _D_e_p_t_h defaults to zero. There is no way to specify an unlimited value for _U_R_L_-_M_a_x or _H_o_s_t_-_M_a_x. 44..33..11.. RRoooottNNooddee ffiilltteerrss Filter files use the standard UNIX regular expression syntax (as defined by the POSIX standard), not the csh ``globbing'' syntax. For example, you would use ``.*abc'' to indicate any string ending with ``abc'', not ``*abc''. A filter file has the following syntax: Deny regex Allow regex The _U_R_L_-_F_i_l_t_e_r regular expressions are matched only on the URL-path portion of each URL (the scheme, hostname and port are excluded). For example, the following URL-Filter file would allow all URLs except those containing the regular expression ``_/_g_a_t_h_e_r_e_r_s_/'': Deny /gatherers/ Allow . Another common use of URL-filters is to prevent the Gatherer from travelling ``up'' a directory. Automatically generated HTML pages for HTTP and FTP directories often contain a link for the parent directory ``_._.''. To keep the gatherer below a specific directory, use a URL- filter file such as: Allow ^/my/cool/sutff/ Deny . The _H_o_s_t_-_F_i_l_t_e_r regular expressions are matched on the ``hostname:port'' portion of each URL. Because the port is included, you cannot use ``$$'' to anchor the end of a hostname. Beginning with version 1.3, IP addresses may be specified in place of hostnames. A class B address such as 128.138.0.0 would be written as ``^^112288\\..113388\\....**'' in regular expression syntax. For example: Deny bcn.boulder.co.us:8080 Deny bvsd.k12.co.us Allow ^128\.138\..* Deny . The order of the AAllllooww and DDeennyy entries is important, since the filters are applied sequentially from first to last. So, for example, if you list ``AAllllooww ..**'' first, no subsequent DDeennyy expressions will be used, since this AAllllooww filter will allow all entries. 44..33..22.. GGeenneerriicc EEnnuummeerraattiioonn pprrooggrraamm ddeessccrriippttiioonn Flexible enumeration can be achieved by giving an EEnnuummeerraattiioonn==EEnnuummeerraattiioonn--PPrrooggrraamm modifier to a RootNode URL. The _E_n_u_m_e_r_a_t_i_o_n_-_P_r_o_g_r_a_m is a filter which takes URLs on standard input and writes new RootNode URLs on standard output. The output format is different than specifying a RootNode URL in a Gatherer configuration file. Each output line must have nine fields separated by spaces. These fields are: URL URL-Max URL-Filter-filename Host-Max Host-Filter-filename Access Delay Depth Enumeration-Program These are the same fields as described in section ``RootNode specifications''. Values must be given for each field. Use _/_d_e_v_/_n_u_l_l to disable the URL-Filter-filename and Host-Filter-filename. Use /bin/false to disable the Enumeration-Program. 44..33..33.. EExxaammppllee RRoooottNNooddee ccoonnffiigguurraattiioonn Below is an example RootNode configuration: (1) http://harvest.cs.colorado.edu/ URL=100,MyFilter (2) http://www.cs.colorado.edu/ Host=50 Delay=60 (3) gopher://gopher.colorado.edu/ Depth=1 (4) file://powell.cs.colorado.edu/home/hardy/ Depth=2 (5) ftp://ftp.cs.colorado.edu/pub/cs/techreports/ Depth=1 (6) http://harvest.cs.colorado.edu/~hardy/hotlist.html \ Depth=1 Delay=60 (7) http://harvest.cs.colorado.edu/~hardy/ \ Depth=2 Access=HTTP|FTP Each of the above RootNodes follows a different enumeration configuration as follows: 1. This RootNode will gather up to 100 documents that pass through the URL name filters contained within the file _M_y_F_i_l_t_e_r. 2. This RootNode will gather the documents from up to the first 50 hosts it encounters while enumerating the specified URL, with no limit on the Depth of link enumeration. It will also wait for 60 seconds between each retrieval. 3. This RootNode will gather only the documents from the top-level menu of the Gopher server at _g_o_p_h_e_r_._c_o_l_o_r_a_d_o_._e_d_u. 4. This RootNode will gather all documents that are in the _/_h_o_m_e_/_h_a_r_d_y directory, or that are in any subdirectory of _/_h_o_m_e_/_h_a_r_d_y. 5. This RootNode will gather only the documents that are in the _/_p_u_b_/_t_e_c_h_r_e_p_o_r_t_s directory which, in this case, is some bibliographic files rather than the technical reports themselves. 6. This RootNode will gather all documents that are within 1 step away from the specified RootNode URL, waiting 60 seconds between each retrieval. This is a good method by which to index your hotlist. By putting an HTML file containing ``hotlist'' pointers as this RootNode, this enumeration will gather the top-level pages to all of your hotlist pointers. 7. This RootNode will gather all documents that are at most 2 steps away from the specified RootNode URL. Furthermore, it will follow and enumerate any HTTP or FTP URLs that it encounters during enumeration. 44..33..44.. GGaatthheerreerr eennuummeerraattiioonn vvss.. ccaannddiiddaattee sseelleeccttiioonn In addition to using the _U_R_L_-_F_i_l_t_e_r and _H_o_s_t_-_F_i_l_t_e_r files for the RootNode specification mechanism described in Section ``RootNode specifications'', you can prevent documents from being indexed through customizing the _s_t_o_p_l_i_s_t_._c_f file, described in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps''. Since these mechanisms are invoked at different times, they have different effects. The _U_R_L_-_F_i_l_t_e_r and _H_o_s_t_-_F_i_l_t_e_r mechanisms are invoked by the Gatherer's ``RootNode'' enumeration programs. Using these filters as stop lists can prevent unwanted objects from being retrieved across the network. This can dramatically reduce gathering time and network traffic. The _s_t_o_p_l_i_s_t_._c_f file is used by the _E_s_s_e_n_c_e content extraction system (described in Section ``Extracting data for indexing: The Essence summarizing subsystem'') _a_f_t_e_r the objects are retrieved, to select which objects should be content extracted and indexed. This can be useful because Essence provides a more powerful means of rejecting indexing candidates, in which you can customize based not only file naming conventions but also on file contents (e.g., looking at strings at the beginning of a file or at UNIX ``magic'' numbers), and also by more sophisticated file-grouping schemes (e.g., deciding not to extract contents from object code files for which source code is available). As an example of combining these mechanisms, suppose you want to index the ``.ps'' files linked into your WWW site. You could do this by having a _s_t_o_p_l_i_s_t_._c_f file that contains ``HTML'', and a RootNode _U_R_L_- _F_i_l_t_e_r that contains: Allow \.html Allow \.ps Deny .* As a final note, independent of these customizations the Gatherer attempts to avoid retrieving objects where possible, by using a local disk cache of objects, and by using the HTTP ``If-Modified-Since'' request header. The local disk cache is described in Section ``The local disk cache''. 44..44.. GGeenneerraattiinngg LLeeaaffNNooddee//RRoooottNNooddee UURRLLss ffrroomm aa pprrooggrraamm It is possible to generate RootNode or LeafNode URLs automatically from program output. This might be useful when gathering a large number of Usenet newsgroups, for example. The program is specified inside the RootNode or LeafNode section, preceded by a pipe symbol. |generate-news-urls.sh The script must output valid URLs, such as news:comp.unix.voodoo news:rec.pets.birds http://www.nlanr.net/ ... In the case of RootNode URLs, enumeration parameters can be given after the program. |my-fave-sites.pl Depth=1 URL=5000,url-filter 44..55.. EExxttrraaccttiinngg ddaattaa ffoorr iinnddeexxiinngg:: TThhee EEsssseennccee ssuummmmaarriizziinngg ssuubbssyysstteemm After the Gatherer retrieves a document, it passes the document through a subsystem called _E_s_s_e_n_c_e to extract indexing information. Essence allows the Gatherer to collect indexing information easily from a wide variety of information, using different techniques depending on the type of data and the needs of the particular corpus being indexed. In a nutshell, Essence can determine the type of data pointed to by a URL (e.g., PostScript vs. HTML), ``unravel'' presentation nesting formats (such as compressed ``tar'' files), select which types of data to index (e.g., don't index Audio files), and then apply a type-specific extraction algorithm (called a _s_u_m_m_a_r_i_z_e_r) to the data to generate a content summary. Users can customize each of these aspects, but often this is not necessary. Harvest is distributed with a ``stock'' set of type recognizers, presentation unnesters, candidate selectors, and summarizers that work well for many applications. Below we describe the stock summarizer set, the current components distribution, and how users can customize summarizers to change how they operate and add summarizers for new types of data. If you develop a summarizer that is likely to be useful to other users, please notify us via email at lee@arco.de so we may include it in our Harvest distribution. Type Summarizer Function -------------------------------------------------------------------- Bibliographic Extract author and titles Binary Extract meaningful strings and manual page summary C, CHeader Extract procedure names, included file names, and comments Dvi Invoke the Text summarizer on extracted ASCII text FAQ, FullText, README Extract all words in file Font Extract comments HTML Extract anchors, hypertext links, and selected fields LaTex Parse selected LaTex fields (author, title, etc.) Mail Extract certain header fields Makefile Extract comments and target names ManPage Extract synopsis, author, title, etc., based on ``-man'' macros News Extract certain header fields Object Extract symbol table Patch Extract patched file names Perl Extract procedure names and comments PostScript Extract text in word processor-specific fashion, and pass through Text summarizer. RCS, SCCS Extract revision control summary RTF Up-convert to HTML and pass through HTML summarizer SGML Extract fields named in extraction table ShellScript Extract comments SourceDistribution Extract full text of README file and comments from Makefile and source code files, and summarize any manual pages SymbolicLink Extract file name, owner, and date created TeX Invoke the Text summarizer on extracted ASCII text Text Extract first 100 lines plus first sentence of each remaining paragraph Troff Extract author, title, etc., based on ``-man'', ``-ms'', ``-me'' macro packages, or extract section headers and topic sentences. Unrecognized Extract file name, owner, and date created. 44..55..11.. DDeeffaauulltt aaccttiioonnss ooff ````ssttoocckk'''' ssuummmmaarriizzeerrss The table in Section ``Extracting data for indexing: The Essence summarizing subsystem'' provides a brief reference for how documents are summarized depending on their type. These actions can be customized, as discussed in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps''. Some summarizers are implemented as UNIX programs while others are expressed as regular expressions; see Section ``Customizing the summarizing step'' or Section ``Example 4'' for more information about how to write a summarizer. 44..55..22.. SSuummmmaarriizziinngg SSGGMMLL ddaattaa It is possible to summarize documents that conform to the Standard Generalized Markup Language (SGML), for which you have a Document Type Definition (DTD). The World Wide Web's Hypertext Mark-up Language (HTML) is actually a particular application of SGML, with a corresponding DTD. (In fact, the Harvest HTML summarizer can use the HTML DTD and our SGML summarizing mechanism, which provides various advantages; see Section ``The SGML-based HTML summarizer''.) SGML is being used in an increasingly broad variety of applications, for example as a format for storing data for a number of physical sciences. Because SGML allows documents to contain a good deal of structure, Harvest can summarize SGML documents very effectively. The SGML summarizer (SGML.sum) uses the sgmls program by James Clark to parse the SGML document. The parser needs both a DTD for the document and a Declaration file that describes the allowed character set. The SGML.sum program uses a table that maps SGML tags to SOIF attributes. 44..55..22..11.. LLooccaattiioonn ooff ssuuppppoorrtt ffiilleess SGML support files can be found in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_- _l_i_b_/. For example, these are the default pathnames for HTML summarizing using the SGML summarizing mechanism: $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/html.dtd $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.decl $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl The location of the DTD file must be specified in the sgmls catalog (_$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_c_a_t_a_l_o_g). For example: DOCTYPE HTML HTML/html.dtd The SGML.sum program looks for the _._d_e_c_l file in the default location. An alternate pathname can be specified with the --dd option to SGML.sum. The summarizer looks for the _._s_u_m_._t_b_l file first in the Gatherer's lib directory and then in the default location. Both of these can be overridden with the --tt option to SGML.sum. 44..55..22..22.. TThhee SSGGMMLL ttoo SSOOIIFF ttaabbllee The translation table provides a simple yet powerful way to specify how an SGML document is to be summarized. There are four ways to map SGML data into SOIF. The first two are concerned with placing the _c_o_n_t_e_n_t of an SGML tag into a SOIF attribute. A simple SGML-to-SOIF mapping looks like this: soif1,soif2,... This places the content that occurs inside the tag ``TAG'' into the SOIF attributes ``soif1'' and ``soif2''. It is possible to select different SOIF attributes based on SGML attribute values. For example, if ``ATT'' is an attribute of ``TAG'', then it would be written like this: x-stuff y-stuff stuff The second two mappings place values of SGML attributes into SOIF attributes. To place the value of the ``ATT'' attribute of the ``TAG'' tag into the ``att-stuff'' SOIF attribute you would write: att-stuff It is also possible to place the value of an SGML attribute into a SOIF attribute named by a different SOIF attribute: $ATT2 When the summarizer encounters an SGML attribute not listed in the table, the content is passed to the parent tag and becomes a part of the parent's content. To force the content of some tag _n_o_t to be passed up, specify the SOIF attribute as ``ignore''. To force the content of some tag to be passed to the parent in addition to being placed into a SOIF attribute, list an addition SOIF attribute named ``parent''. Please see Section ``The SGML-based HTML summarizer'' for examples of these mappings. 44..55..22..33.. EErrrroorrss aanndd wwaarrnniinnggss ffrroomm tthhee SSGGMMLL PPaarrsseerr The sgmls parser can generate an overwhelming volume of error and warning messages. This will be especially true for HTML documents found on the Internet, which often do not conform to the strict HTML DTD. By default, errors and warnings are redirected to _/_d_e_v_/_n_u_l_l so that they do not clutter the Gatherer's log files. To enable logging of these messages, edit the SGML.sum Perl script and set $$ssyynnttaaxx__cchheecckk == 11. 44..55..22..44.. CCrreeaattiinngg aa ssuummmmaarriizzeerr ffoorr aa nneeww SSGGMMLL--ttaaggggeedd ddaattaa ttyyppee To create an SGML summarizer for a new SGML-tagged data type with an associated DTD, you need to do the following: 1. Write a shell script named FOO.sum which simply contains #!/bin/sh exec SGML.sum FOO $* 2. Modify the essence configuration files (as described in Section ``Customizing the type recognition step'') so that your documents get typed as FOO. 3. Create the directory _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_F_O_O_/ and copy your DTD and Declaration there as FOO.dtd and FOO.decl. Edit _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_c_a_t_a_l_o_g and add FOO.dtd to it. 4. Create the translation table FOO.sum.tbl and place it with the DTD in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_F_O_O_/. At this point you can test everything from the command line as follows: % FOO.sum myfile.foo 44..55..22..55.. TThhee SSGGMMLL--bbaasseedd HHTTMMLL ssuummmmaarriizzeerr Harvest can summarize HTML using the generic SGML summarizer described in Section ``Summarizing SGML data''. The advantage of this approach is that the summarizer is more easily customizable, and fits with the well-conceived SGML model (where you define DTDs for individual document types and build interpretation software to understand DTDs rather than individual document types). The downside is that the summarizer is now pickier about syntax, and many Web documents are not syntactically correct. Because of this pickiness, the default is for the HTML summarizer to run with syntax checking outputs disabled. If your documents are so badly formed that they confuse the parser, this may mean the summarizing process dies unceremoniously. If you find that some of your HTML documents do not get summarized or only get summarized in part, you can turn syntax-checking output on by setting $$ssyynnttaaxx__cchheecckk == 11 in $HARVEST_HOME/lib/gatherer/SGML.sum. That will allow you to see which documents are invalid and where. Note that part of the reason for this problem is that Web browsers do not insist on well-formed documents. So, users can easily create documents that are not completely valid, yet display fine. Below is the default SGML-to-SOIF table used by the HTML summarizer: HTML ELEMENT SOIF ATTRIBUTES ------------ ----------------------- keywords,parent url-references
address keywords,parent body references ignore keywords,parent

headings

headings

headings

headings

headings
headings head keywords,parent images $NAME keywords,parent title <TT> keywords,parent <UL> keywords,parent The pathname to this file is _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_- _l_i_b_/_H_T_M_L_/_H_T_M_L_._s_u_m_._t_b_l. Individual Gatherers may do customized HTML summarizing by placing a modified version of this file in the Gatherer _l_i_b directory. Another way to customize is to modify the HTML.sum script and add a --tt option to the SGML.sum command. For example: SGML.sum -t $HARVEST_HOME/lib/my-HTML.table HTML $* In HTML, the document title is written as: <TITLE>My Home Page The above translation table will place this in the SOIF summary as: title{13}: My Home Page Note that ``keywords,parent'' occurs frequently in the table. For any specially marked text (bold, emphasized, hypertext links, etc.), the words will be copied into the keywords attribute and also left in the content of the parent element. This keeps the body of the text readable by not removing certain words. Any text that appears inside a pair of CODE tags will not show up in the summary because we specified ``ignore'' as the SOIF attribute. URLs in HTML anchors are written as: The specification for <> in the above translation table causes this to appear as: url-references{32}: http://harvest.cs.colorado.edu/ 44..55..22..66.. AAddddiinngg MMEETTAA ddaattaa ttoo yyoouurr HHTTMMLL One of the most useful HTML tags is META. This allows the document writer to include arbitrary metadata in an HTML document. A Typical usage of the META element is: By specifying ``<> $NAME'' in the translation table, this comes out as: author{15}: Joe T. Slacker Using the META tags, HTML authors can easily add a list of keywords to their documents: 44..55..22..77.. OOtthheerr eexxaammpplleess A very terse HTML summarizer could be specified with a table that only puts emphasized words into the keywords attribute: HTML ELEMENT SOIF ATTRIBUTES ------------ ----------------------- keywords keywords keywords

keywords

keywords

keywords keywords $NAME keywords title,keywords <TT> keywords Conversely, a full-text summarizer can be easily specified with only: HTML ELEMENT SOIF ATTRIBUTES ------------ ----------------------- <HTML> full-text <TITLE> title,parent 44..55..33.. CCuussttoommiizziinngg tthhee ttyyppee rreeccooggnniittiioonn,, ccaannddiiddaattee sseelleeccttiioonn,, pprreesseenn-- ttaattiioonn uunnnneessttiinngg,, aanndd ssuummmmaarriizziinngg sstteeppss The Harvest Gatherer's actions are defined by a set of configuration and utility files, and a corresponding set of executable programs referenced by some of the configuration files. If you want to customize a Gatherer, you should create _b_i_n and _l_i_b subdirectories in the directory where you are running the Gatherer, and then copy _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_*_._c_f and _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_m_a_g_i_c into your _l_i_b directory. Then add to your Gatherer configuration file: Lib-Directory: lib The details about what each of these files does are described below. The basic contents of a typical Gatherer's directory is as follows (note: some of the file names below can be changed by setting variables in the Gatherer configuration file, as described in Section ``Setting variables in the Gatherer configuration file''): RunGatherd* bin/ GathName.cf log.errors tmp/ RunGatherer* data/ lib/ log.gatherer bin: MyNewType.sum* data: All-Templates.gz INFO.soif PRODUCTION.gdbm gatherd.log INDEX.gdbm MD5.gdbm gatherd.cf lib: bycontent.cf byurl.cf quick-sum.cf byname.cf magic stoplist.cf tmp: The RunGatherd and RunGatherer are used to export the Gatherer's database after a machine reboot and to run the Gatherer, respectively. The _l_o_g_._e_r_r_o_r_s and _l_o_g_._g_a_t_h_e_r_e_r files contain error messages and the output of the _E_s_s_e_n_c_e typing step, respectively (Essence will be described shortly). The _G_a_t_h_N_a_m_e_._c_f file is the Gatherer's configuration file. The _b_i_n directory contains any summarizers and any other program needed by the summarizers. If you were to customize the Gatherer by adding a summarizer, you would place those programs in this _b_i_n directory; the MyNewType.sum is an example. The _d_a_t_a directory contains the Gatherer's database which gatherd exports. The Gatherer's database consists of the _A_l_l_-_T_e_m_p_l_a_t_e_s_._g_z_, _I_N_D_E_X_._g_d_b_m_, _I_N_F_O_._s_o_i_f_, _M_D_5_._g_d_b_m and _P_R_O_D_U_C_T_I_O_N_._g_d_b_m files. The _g_a_t_h_e_r_d_._c_f file is used to support access control as described in Section ``Controlling access to the Gatherer's database''. The _g_a_t_h_e_r_d_._l_o_g file is where the gatherd program logs its information. The _l_i_b directory contains the configuration files used by the Gatherer's subsystems, namely Essence. These files are described briefly in the following table: bycontent.cf Content parsing heuristics for type recognition step byname.cf File naming heuristics for type recognition step byurl.cf URL naming heuristics for type recognition step magic UNIX ``file'' command specifications (matched against bycontent.cf strings) quick-sum.cf Extracts attributes for summarizing step. stoplist.cf File types to reject during candidate selection 44..55..33..11.. CCuussttoommiizziinngg tthhee ttyyppee rreeccooggnniittiioonn sstteepp Essence recognizes types in three ways (in order of precedence): by URL naming heuristics, by file naming heuristics, and by locating _i_d_e_n_t_i_f_y_i_n_g data within a file using the UNIX file command. To modify the type recognition step, edit _l_i_b_/_b_y_n_a_m_e_._c_f to add file naming heuristics, or _l_i_b_/_b_y_u_r_l_._c_f to add URL naming heuristics, or _l_i_b_/_b_y_c_o_n_t_e_n_t_._c_f to add by-content heuristics. The by-content heuristics match the output of the UNIX file command, so you may also need to edit the _l_i_b_/_m_a_g_i_c file. See Section ``Example 3'' and ``Example 4'' for detailed examples on how to customize the type recognition step. 44..55..33..22.. CCuussttoommiizziinngg tthhee ccaannddiiddaattee sseelleeccttiioonn sstteepp The _l_i_b_/_s_t_o_p_l_i_s_t_._c_f configuration file contains a list of types that are rejected by Essence. You can add or delete types from _l_i_b_/_s_t_o_p_l_i_s_t_._c_f to control the candidate selection step. To direct Essence to index only certain types, you can list the types to index in _l_i_b_/_a_l_l_o_w_l_i_s_t_._c_f. Then, supply Essence with the ----aalllloowwlliisstt flag. The file and URL naming heuristics used by the type recognition step (described in Section ``Customizing the type recognition step'') are particularly useful for candidate selection when gathering remote data. They allow the Gatherer to avoid retrieving files that you don't want to index (in contrast, recognizing types by locating identifying data within a file requires that the file be retrieved first). This approach can save quite a bit of network traffic, particularly when used in combination with enumerated _R_o_o_t_N_o_d_e URLs. For example, many sites provide each of their files in both a compressed and uncompressed form. By building a _l_i_b_/_a_l_l_o_w_l_i_s_t_._c_f containing only the Compressed types, you can avoid retrieving the uncompressed versions of the files. 44..55..33..33.. CCuussttoommiizziinngg tthhee pprreesseennttaattiioonn uunnnneessttiinngg sstteepp Some types are declared as ``nested'' types. Essence treats these differently than other types, by running a presentation unnesting algorithm or ``Exploder'' on the data rather than a Summarizer. At present Essence can handle files nested in the following formats: 1. binhex 2. uuencode 3. shell archive (``shar'') 4. tape archive (``tar'') 5. bzip2 compressed (``bzip2'') 6. compressed 7. GNU compressed (``gzip'') 8. zip compressed archive To customize the presentation unnesting step you can modify the Essence source file _s_r_c_/_g_a_t_h_e_r_e_r_/_e_s_s_e_n_c_e_/_u_n_n_e_s_t_._c. This file lists the available presentation encodings, and also specifies the unnesting algorithm. Typically, an external program is used to unravel a file into one or more component files (e.g. bzip2, gunzip, uudecode, and tar). An _E_x_p_l_o_d_e_r may also be used to explode a file into a stream of SOIF objects. An Exploder program takes a URL as its first command-line argument and a file containing the data to use as its second, and then generates one or more SOIF objects as output. For your convenience, the _E_x_p_l_o_d_e_r type is already defined as a nested type. To save some time, you can use this type and its corresponding Exploder.unnest program rather than modifying the Essence code. See Section ``Example 2'' for a detailed example on writing an Exploder. The _u_n_n_e_s_t_._c file also contains further information on defining the unnesting algorithms. 44..55..33..44.. CCuussttoommiizziinngg tthhee ssuummmmaarriizziinngg sstteepp Essence supports two mechanisms for defining the type-specific extraction algorithms (called _S_u_m_m_a_r_i_z_e_r_s) that generate content summaries: a UNIX program that takes as its only command line argument the filename of the data to summarize, and line-based regular expressions specified in _l_i_b_/_q_u_i_c_k_-_s_u_m_._c_f. See Section ``Example 4'' for detailed examples on how to define both types of Summarizers. The UNIX Summarizers are named using the convention TypeName.sum (e.g., PostScript.sum). These Summarizers output their content summary in a SOIF attribute-value list (see Section ``The Summary Object Interchange Format (SOIF)''). You can use the wrapit command to wrap raw output into the SOIF format (i.e., to provide byte-count delimiters on the individual attribute-value pairs). There is a summarizer called FullText.sum that you can use to perform full text indexing of selected file types, by simply setting up the _l_i_b_/_b_y_c_o_n_t_e_n_t_._c_f and _l_i_b_/_b_y_n_a_m_e_._c_f configuration files to recognize the desired file types as FullText (i.e., using ``FullText'' in column 1 next to the matching regular expression). 44..66.. PPoosstt--SSuummmmaarriizziinngg:: RRuullee--bbaasseedd ttuunniinngg ooff oobbjjeecctt ssuummmmaarriieess It is possible to ``fine-tune'' the summary information generated by the Essence summarizers. A typical application of this would be to change the _T_i_m_e_-_t_o_-_L_i_v_e attribute based on some knowledge about the objects. So an administrator could use the post-summarizing feature to give quickly-changing objects a lower TTL, and very stable documents a higher TTL. Objects are selected for post-summarizing if they meet a specified condition. A condition consists of three parts: An attribute name, an operation, and some string data. For example: city == 'New York' In this case we are checking if the _c_i_t_y attribute is equal to the string `New York'. For exact string matching, the string data must be enclosed in single quotes. Regular expressions are also supported: city ~ /New York/ Negative operators are also supported: city != 'New York' city !~ /New York/ Conditions can be joined with `&&&&' (logical and) or `||||' (logical or) operators: city == 'New York' && state != 'NY'; When all conditions are met for an object, some number of instructions are executed on it. There are four types of instructions which can be specified: 1. Set an attribute exactly to some specific string. Example: time-to-live = "86400" 2. Filter an attribute through some program. The attribute value is given as input to the filter. The output of the filter becomes the new attribute value. Example: keywords | tr A-Z a-z 3. Filter multiple attributes through some program. In this case the filter must read and write attributes in the SOIF format. Example: address,city,state,zip ! cleanup-address.pl 4. A special case instruction is to delete an object. To do this, simply write: delete() 44..66..11.. TThhee RRuulleess ffiillee The conditions and instructions are combined together in a ``rules'' file. The format of this file is somewhat similar to a Makefile; conditions begin in the first column and instructions are indented by a tab-stop. Example: type == 'HTML' partial-text | cleanup-html-text.pl URL ~ /users/ time-to-live = "86400" partial-text ! extract-owner.sh type == 'SOIFStream' delete() This rules file is specified in the gatherer.cf file with the Post- Summarizing tag, e.g.: Post-Summarizing: lib/myrules 44..66..22.. RReewwrriittiinngg UURRLLss Until version 1.4 it was not possible to rewrite the URL-part of an object summary. It is now possible, but only by using the ``pipe'' instruction. This may be useful for people wanting to run a Gatherer on _f_i_l_e_:_/_/ URLs, but have them appear as _h_t_t_p_:_/_/ URLs. This can be done with a post-summarizing rule such as: url ~ 'file://localhost/web/htdocs/' url | fix-url.pl And the 'fix-url.pl' script might look like: #!/usr/local/bin/perl -p s'file://localhost/web/htdocs/'http://www.my.domain/'; 44..77.. GGaatthheerreerr aaddmmiinniissttrraattiioonn 44..77..11.. SSeettttiinngg vvaarriiaabblleess iinn tthhee GGaatthheerreerr ccoonnffiigguurraattiioonn ffiillee In addition to customizing the steps described in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps'', you can customize the Gatherer by setting variables in the Gatherer configuration file. This file consists of two parts: a list of variables that specify information about the Gatherer (such as its name, host, and port number), and two lists of URLs (divided into RRoooottNNooddeess and LLeeaaffNNooddeess) from which to collect indexing information. Section ``Basic setup'' shows an example Gatherer configuration file. In this section we focus on the variables that the user can set in the first part of the Gatherer configuration file. Each variable name starts in the first column, ends with a colon, then is followed by the value. The following table shows the supported variables: Access-Delay: Default delay between URLs accesses. Data-Directory: Directory where GDBM database is written. Debug-Options: Debugging options passed to child programs. Errorlog-File: File for logging errors. Essence-Options: Any extra options to pass to Essence. FTP-Auth: Username/password for protected FTP documents. Gatherd-Inetd: Denotes that gatherd is run from inetd. Gatherer-Host: Full hostname where the Gatherer is run. Gatherer-Name: A Unique name for the Gatherer. Gatherer-Options: Extra options for the Gatherer. Gatherer-Port: Port number for gatherd. Gatherer-Version: Version string for the Gatherer. HTTP-Basic-Auth: Username/password for protected HTTP documents. HTTP-Proxy: host:port of your HTTP proxy. Keep-Cache: ``yes'' to not remove local disk cache. Lib-Directory: Directory where configuration files live. Local-Mapping: Mapping information for local gathering. Log-File: File for logging progress. Post-Summarizing: A rules-file for post-summarizing. Refresh-Rate: Object refresh-rate in seconds, default 1 week. Time-To-Live: Object time-to-live in seconds, default 1 month. Top-Directory: Top-level directory for the Gatherer. Working-Directory: Directory for tmp files and local disk cache. Notes: +o We recommend that you use the TToopp--DDiirreeccttoorryy variable, since it will set the DDaattaa--DDiirreeccttoorryy, LLiibb--DDiirreeccttoorryy, and WWoorrkkiinngg--DDiirreeccttoorryy variables. +o Both WWoorrkkiinngg--DDiirreeccttoorryy and DDaattaa--DDiirreeccttoorryy will have files in them after the Gatherer has run. The WWoorrkkiinngg--DDiirreeccttoorryy will hold the local-disk cache that the Gatherer uses to reduce network I/O, and the DDaattaa--DDiirreeccttoorryy will hold the GDBM databases that contain the content summaries. +o You should use full rather than relative pathnames. +o All variable definitions _m_u_s_t come before the RootNode or LeafNode URLs. +o Any line that starts with a ``#'' is a comment. +o LLooccaall--MMaappppiinngg is discussed in Section ``Local file system gathering for reduced CPU load''. +o HHTTTTPP--PPrrooxxyy will retrieve HTTP URLs via a proxy host. The syntax is hhoossttnnaammee::ppoorrtt; for example, pprrooxxyy..yyoouurrssiittee..ccoomm::33112288. +o EEsssseennccee--OOppttiioonnss is particularly useful, as it lets you customize basic aspects of the Gatherer easily. +o The only valid GGaatthheerreerr--OOppttiioonnss is ----ssaavvee--ssppaaccee which directs the Gatherer to be more space efficient when preparing its database for export. +o The Gatherer program will accept the --bbaacckkggrroouunndd flag which will cause the Gatherer to run in the background. The Essence options are: Option Meaning -------------------------------------------------------------------- --allowlist filename File with list of types to allow --fake-md5s Generates MD5s for SOIF objects from a .unnest program --fast-summarizing Trade speed for some consistency. Use only when an external summarizer is known to generate clean, unique attributes. --full-text Use entire file instead of summarizing. Alternatively, you can perform full text indexing of individual file types by using the FullText.sum summarizer. --max-deletions n Number of GDBM deletions before reorganization --minimal-bookkeeping Generates a minimal amount of bookkeeping attrs --no-access Do not read contents of objects --no-keywords Do not automatically generate keywords --stoplist filename File with list of types to remove --type-only Only type data; do not summarize objects A particular note about full text summarizing: Using the Essence ----ffuullll--tteexxtt option causes files not to be passed through the Essence content extraction mechanism. Instead, their entire content is included in the SOIF summary stream. In some cases this may produce unwanted results (e.g., it will directly include the PostScript for a document rather than first passing the data through a PostScript to text extractor, providing few searchable terms and large SOIF objects). Using the individual file type summarizing mechanism described in Section ``Customizing the summarizing step'' will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence ----ffuullll--tteexxtt option to perform content extraction before including the full text of documents. 44..77..22.. LLooccaall ffiillee ssyysstteemm ggaatthheerriinngg ffoorr rreedduucceedd CCPPUU llooaadd Although the Gatherer's work load is specified using URLs, often the files being gathered are located on a local file system. In this case it is much more efficient to gather directly from the local file system than via FTP/Gopher/HTTP/News, primarily because of all the UNIX forking required to gather information via these network processes. For example, our measurements indicate it causes from 4-7x more CPU load to gather from FTP than directly from the local file system. For large collections (e.g., archive sites containing many thousands of files), the CPU savings can be considerable. Starting with Harvest Version 1.1, it is possible to tell the Gatherer how to translate URLs to local file system names, using the LLooccaall-- MMaappppiinngg Gatherer configuration file variable (see Section ``Setting variables in the Gatherer configuration file''). The syntax is: Local-Mapping: URL_prefix local_path_prefix This causes all URLs starting with UURRLL__pprreeffiixx to be translated to files starting with the prefix llooccaall__ppaatthh__pprreeffiixx while gathering, but to be left as URLs in the results of queries (so the objects can be retrieved as usual). Note that no regular expressions are supported here. As an example, the specification Local-Mapping: http://harvest.cs.colorado.edu/~hardy/ /homes/hardy/public_html/ Local-Mapping: ftp://ftp.cs.colorado.edu/pub/cs/ /cs/ftp/ would cause the URL _h_t_t_p_:_/_/_h_a_r_v_e_s_t_._c_s_._c_o_l_o_r_a_d_o_._e_d_u_/_~_h_a_r_d_y_/_H_o_m_e_._h_t_m_l to be translated to the local file name _/_h_o_m_e_s_/_h_a_r_d_y_/_p_u_b_l_i_c___h_t_m_l_/_H_o_m_e_._h_t_m_l, while the URL _f_t_p_:_/_/_f_t_p_._c_s_._c_o_l_o_r_a_d_o_._e_d_u_/_p_u_b_/_c_s_/_t_e_c_h_r_e_p_o_r_t_s_/_s_c_h_w_a_r_t_z_/_H_a_r_v_e_s_t_._C_o_n_f_._p_s_._Z would be translated to the local file name _/_c_s_/_f_t_p_/_t_e_c_h_r_e_p_o_r_t_s_/_s_c_h_w_a_r_t_z_/_H_a_r_v_e_s_t_._C_o_n_f_._p_s_._Z. Local gathering will work over NFS file systems. A local mapping will fail if: the local file cannot be opened for reading; or the local file is not a regular file; or the local file has execute bits set. So, for directories, symbolic links and CGI scripts, the server is always contacted rather than the local file system. Lastly, the Gatherer does not perform any URL syntax translations for local mappings. If your URL has characters that should be escaped (as in RFC1738), then the local mapping will fail. Starting with version 1.4 patchlevel 2 Essence will print _[_L_] after URLs which were successfully accessed locally. Note that if your network is highly congested, it may actually be faster to gather via HTTP/FTP/Gopher than via NFS, because NFS becomes very inefficient in highly congested situations. Even better would be to run local Gatherers on the hosts where the disks reside, and access them directly via the local file system. 44..77..33.. GGaatthheerriinngg ffrroomm ppaasssswwoorrdd--pprrootteecctteedd sseerrvveerrss You can gather password-protected documents from HTTP and FTP servers. In both cases, you can specify a username and password as a part of the URL. The format is as follows: ftp://user:password@host:port/url-path http://user:password@host:port/url-path With this format, the ``user:password'' part is kept as a part of the URL string all throughout Harvest. This may enable anyone who uses your Broker(s) to access password-protected documents. You can keep the username and password information ``hidden'' by specifying the authentication information in the Gatherer configuration file. For HTTP, the format is as follows: HTTP-Basic-Auth: realm username password where rreeaallmm is the same as the AAuutthhNNaammee parameter given in an Apache httpd _h_t_t_p_d_._c_o_n_f or _._h_t_a_c_c_e_s_s file. In other httpd server configuration, the realm value is sometimes called SSeerrvveerrIIdd. For FTP, the format in the gatherer.cf file is FTP-Auth: hostname[:port] username password 44..77..44.. CCoonnttrroolllliinngg aacccceessss ttoo tthhee GGaatthheerreerr''ss ddaattaabbaassee You can use the _g_a_t_h_e_r_d_._c_f file (placed in the DDaattaa--DDiirreeccttoorryy of a Gatherer) to control access to the Gatherer's database. A line that begins with AAllllooww is followed by any number of domain or host names that are allowed to connect to the Gatherer. If the word aallll is used, then all hosts are matched. DDeennyy is the opposite of AAllllooww. The following example will only allow hosts in the ccss..ccoolloorraaddoo..eedduu or uusscc..eedduu domain access the Gatherer's database: Allow cs.colorado.edu usc.edu Deny all 44..77..55.. PPeerriiooddiicc ggaatthheerriinngg aanndd rreeaallttiimmee uuppddaatteess The Gatherer program does not automatically do any periodic updates -- when you run it, it processes the specified URLs, starts up a gatherd daemon (if one isn't already running), and then exits. If you want to update the data periodically (e.g., to capture new files as they are added to an FTP archive), you need to use the UNIX cron command to run the Gatherer program at some regular interval. To set up periodic gathering via cron, use the RunGatherer command that RunHarvest will create. An example RunGatherer script follows: #!/bin/sh # # RunGatherer - Runs the ATT 800 Gatherer (from cron) # HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME PATH=${HARVEST_HOME}/bin:${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/lib:$PATH export PATH NNTPSERVER=localhost; export NNTPSERVER cd /usr/local/harvest/gatherers/att800 exec Gatherer "att800.cf" You should run the RunGatherd command from your system startup (e.g. _/_e_t_c_/_r_c_._l_o_c_a_l) file, so the Gatherer's database is exported each time the machine reboots. An example RunGatherd script follows: #!/bin/sh # # RunGatherd - starts up the gatherd process (from /etc/rc.local) # HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME PATH=${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/bin:$PATH; export PATH exec gatherd -d /usr/local/harvest/gatherers/att800/data 8500 44..77..66.. TThhee llooccaall ddiisskk ccaacchhee The Gatherer maintains a local disk cache of files it gathers to reduce network traffic from restarting aborted gathering attempts. However, since the remote server must still be contacted whenever Gatherer runs, please do not set your cron job to run Gatherer frequently. A typical value might be weekly or monthly, depending on how congested the network and how important it is to have the most current data. By default, the Gatherer's local disk cache is deleted after each successful completion. To save the local disk cache between Gatherer sessions, define KKeeeepp--CCaacchhee:: yyeess in your Gatherer configuration file (Section ``Setting variables in the Gatherer configuration file''). If you want your Broker's index to reflect new data, then you must run the Gatherer _a_n_d run a Broker collection. By default, a Broker will perform collections once a day. If you want the Broker to collect data as soon as it's gathered, then you will need to coordinate the timing of the completion of the Gatherer and the Broker collections. If you run your Gatherer frequently and you use the KKeeeepp--CCaacchhee:: yyeess in your Gatherer configuration file, then the Gatherer's local disk cache may interfere with retrieving updates. By default, objects in the local disk cache expire after 7 days; however, you can expire objects more quickly by setting the $$GGAATTHHEERREERR__CCAACCHHEE__TTTTLL environment variable to the number of seconds for the Time-To-Live (TTL) before you run the Gatherer, or you can change RunGatherer to remove the Gatherer's _t_m_p directory after each Gatherer run. For example, to expire objects in the local disk cache after one day: % setenv GATHERER_CACHE_TTL 86400 # one day % ./RunGatherer The Gatherer's local disk cache size defaults to 32 MBs, but you can change this value by setting the $$HHAARRVVEESSTT__MMAAXX__LLOOCCAALL__CCAACCHHEE environment variable to the number of MBs before you run the Gatherer. For example, to have a maximum cache of 10 MB you can do as follows: % setenv HARVEST_MAX_LOCAL_CACHE 10 # 10 MB % ./RunGatherer If you have access to the software that creates the files that you are indexing (e.g., if all updates are funneled through a particular editor, update script, or system call), you can modify this software to schedule realtime Gatherer updates whenever a file is created or updated. For example, if all users update the files being indexed using a particular program, this program could be modified to run the Gatherer upon completion of the user's update. Note that, when used in conjunction with cron, the Gatherer provides a powerful data ``mirroring'' facility. You can use the Gatherer to replicate the contents of one or more sites, retrieve data in multiple formats via multiple protocols (FTP, HTTP, etc.), optionally perform a variety of type- or site-specific transformations on the data, and serve the results very efficiently as compressed SOIF object summary streams to other sites that wish to use the data for building indexes or for other purposes. 44..77..77.. IInnccoorrppoorraattiinngg mmaannuuaallllyy ggeenneerraatteedd iinnffoorrmmaattiioonn iinnttoo aa GGaatthheerreerr You may want to inspect the quality of the automatically-generated SOIF templates. In general, Essence's techniques for automatic information extraction produce imperfect results. Sometimes it is possible to customize the summarizers to better suit the particular context (see Section ``Customizing the summarizing step''). Sometimes, however, it makes sense to augment or change the automatically generated keywords with manually entered information. For example, you may want to add _T_i_t_l_e attributes to the content summaries for a set of PostScript documents (since it's difficult to parse them out of PostScript automatically). Harvest provides some programs that automatically clean up a Gatherer's database. The rmbinary program removes any binary data from the templates. The cleandb program does some simple validation of SOIF objects, and when given the --ttrruunnccaattee flag it will truncate the _K_e_y_w_o_r_d_s data field to 8 kilobytes. To help in manually managing the Gatherer's databases, the gdbmutil GDBM database management tool is provided in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r. In a future release of Harvest we will provide a forms-based mechanism to make it easy to provide manual annotations. In the meantime, you can annotate the Gatherer's database with manually generated information by using the mktemplate, template2db, mergedb, and mkindex programs. You first need to create a file (called, say, _a_n_n_o_t_a_t_i_o_n_s) in the following format: @FILE { url1 Attribute-Name-1: DATA Attribute-Name-2: DATA ... Attribute-Name-n: DATA } @FILE { url2 Attribute-Name-1: DATA Attribute-Name-2: DATA ... Attribute-Name-n: DATA } ... Note that the _A_t_t_r_i_b_u_t_e_s must begin in column 0 and have one tab after the colon, and the _D_A_T_A must be on a single line. Next, run the mktemplate and template2db programs to generate SOIF and then GDBM versions of these data (you can have several files containing the annotations, and generate a single GDBM database from the above commands): % set path = ($HARVEST_HOME/lib/gatherer $path) % mktemplate annotations [annotations2 ...] | template2db annotations.gdbm Finally, you run mergedb to incorporate the annotations into the automatically generated data, and mkindex to generate an index for it. The usage line for mergedb is: mergedb production automatic manual [manual ...] The idea is that _p_r_o_d_u_c_t_i_o_n is the final GDBM database that the Gatherer will serve. This is a _n_e_w database that will be generated from the other databases on the command line. _a_u_t_o_m_a_t_i_c is the GDBM database that a Gatherer automatically generated in a previous run (e.g., _W_O_R_K_I_N_G_._g_d_b_m or a previous _P_R_O_D_U_C_T_I_O_N_._g_d_b_m). _m_a_n_u_a_l and so on are the GDBM databases that you manually created. When mergedb runs, it builds the _p_r_o_d_u_c_t_i_o_n database by first copying the templates from the _m_a_n_u_a_l databases, and then merging in the attributes from the _a_u_t_o_m_a_t_i_c database. In case of a conflict (the same attribute with different values in the _m_a_n_u_a_l and _a_u_t_o_m_a_t_i_c databases), the _m_a_n_u_a_l values override the _a_u_t_o_m_a_t_i_c values. By keeping the automatically and manually generated data stored separately, you can avoid losing the manual updates when doing periodic automatic gathering. To do this, you will need to set up a script to remerge the manual annotations with the automatically gathered data after each gathering. An example use of mergedb is: % mergedb PRODUCTION.new PRODUCTION.gdbm annotations.gdbm % mv PRODUCTION.new PRODUCTION.gdbm % mkindex If the manual database looked like this: @FILE { url1 my-manual-attribute: this is a neat attribute } and the automatic database looked like this: @FILE { url1 keywords: boulder colorado file-size: 1034 md5: c3d79dc037efd538ce50464089af2fb6 } then in the end, the production database will look like this: @FILE { url1 my-manual-attribute: this is a neat attribute keywords: boulder colorado file-size: 1034 md5: c3d79dc037efd538ce50464089af2fb6 } 44..88.. TTrroouubblleesshhoooottiinngg DDeebbuuggggiinngg Extra information from specific programs and library routines can be logged by setting debugging flags. A debugging flag has the form --DDsseeccttiioonn,,lleevveell. _S_e_c_t_i_o_n is an integer in the range 1-255, and _l_e_v_e_l is an integer in the range 1-9. Debugging flags can be given on a command line, with the DDeebbuugg--OOppttiioonnss:: tag in a gatherer configuration file, or by setting the environment variable $$HHAARRVVEESSTT__DDEEBBUUGG. Examples: Debug-Options: -D68,5 -D44,1 % httpenum -D20,1 -D21,1 -D42,1 http://harvest.cs.colorado.edu/ % setenv HARVEST_DEBUG '-D20,1 -D23,1 -D63,1' Debugging sections and levels have been assigned to the following sections of the code: section 20, level 1, 5, 9 Common liburl URL processing section 21, level 1, 5, 9 Common liburl HTTP routines section 22, level 1, 5 Common liburl disk cache routines section 23, level 1 Common liburl FTP routines section 24, level 1 Common liburl Gopher routines section 25, level 1 urlget - standalone liburl program. section 26, level 1 ftpget - standalone liburl program. section 40, level 1, 5, 9 Gatherer URL enumeration section 41, level 1 Gatherer enumeration URL verification section 42, level 1, 5, 9 Gatherer enumeration for HTTP section 43, level 1, 5, 9 Gatherer enumeration for Gopher section 44, level 1, 5 Gatherer enumeration filter routines section 45, level 1 Gatherer enumeration for FTP section 46, level 1 Gatherer enumeration for file:// URLs section 48, level 1, 5 Gatherer enumeration robots.txt stuff section 60, level 1 Gatherer essence data object processing section 61, level 1 Gatherer essence database routines section 62, level 1 Gatherer essence main section 63, level 1 Gatherer essence type recognition section 64, level 1 Gatherer essence object summarizing section 65, level 1 Gatherer essence object unnesting section 66, level 1, 2, 5 Gatherer essence post-summarizing section 67, level 1 Gatherer essence object-ID code section 69, level 1, 5, 9 Common SOIF template processing section 70, level 1, 5, 9 Broker registry section 71, level 1 Broker collection routines section 72, level 1 Broker SOIF parsing routines section 73, level 1, 5, 9 Broker registry hash tables section 74, level 1 Broker storage manager routines section 75, level 1, 5 Broker query manager routines section 75, level 4 Broker query_list debugging section 76, level 1 Broker event management routines section 77, level 1 Broker main section 78, level 9 Broker select(2) loop section 79, level 1, 5, 9 Broker gatherer-id management section 80, level 1 Common utilities memory management section 81, level 1 Common utilities buffer routines section 82, level 1 Common utilities system(3) routines section 83, level 1 Common utilities pathname routines section 84, level 1 Common utilities hostname processing section 85, level 1 Common utilities string processing section 86, level 1 Common utilities DNS host cache section 101, level 1 Broker PLWeb indexing engine section 102, level 1, 2, 5 Broker Glimpse indexing engine section 103, level 1 Broker Swish indexing engine SSyymmppttoomm The Gatherer _d_o_e_s_n_'_t _p_i_c_k _u_p _a_l_l _t_h_e _o_b_j_e_c_t_s pointed to by some of my RootNodes. SSoolluuttiioonn The Gatherer places various limits on enumeration to prevent a misconfigured Gatherer from abusing servers or running wildly. See section ``RootNode specifications'' for details on how to override these limits. SSyymmppttoomm _L_o_c_a_l_-_M_a_p_p_i_n_g _d_i_d _n_o_t _w_o_r_k for me - it retrieved the objects via the usual remote access protocols. SSoolluuttiioonn A local mapping will fail if: +o the local filename cannot be opened for reading; or, +o the local filename is not a regular file; or, +o the local filename has execute bits set. So for directories, symbolic links, and CGI scripts, the HTTP server is always contacted. We don't perform URL translation for local mappings. If your URL's have funny characters that must be escaped, then the local mapping will also fail. Add debug option --DD2200,,11 to understand how local mappings are taking place. SSyymmppttoomm Using the ----ffuullll--tteexxtt option I see a lot of _r_a_w _d_a_t_a in the content summaries, with few keywords I can search. SSoolluuttiioonn At present ----ffuullll--tteexxtt simply includes the full data content in the SOIF summaries. Using the individual file type summarizing mechanism described in Section ``Customizing the summarizing step'' will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence ----ffuullll--tteexxtt option to perform content extraction before including the full text of documents. SSyymmppttoomm No indexing terms are being generated in the SOIF summary for the META tags in my HTML documents. SSoolluuttiioonn This probably indicates that your HTML is not syntactically well-formed, and hence the SGML-based HTML summarizer is not able to recognize it. See Section ``Summarizing SGML data'' for details and debugging options. SSyymmppttoomm Gathered data are _n_o_t _b_e_i_n_g _u_p_d_a_t_e_d. SSoolluuttiioonn The Gatherer does not automatically do periodic updates. See Section ``Periodic gathering and realtime updates'' for details. SSyymmppttoomm The Gatherer puts _s_l_i_g_h_t_l_y _d_i_f_f_e_r_e_n_t _U_R_L_s in the _S_O_I_F summaries than I specified in the Gatherer _c_o_n_f_i_g_u_r_a_t_i_o_n _f_i_l_e. SSoolluuttiioonn This happens because the Gatherer attempts to put URLs into a canonical format. It does this by removing default port numbers and similar cosmetic changes. Also, by default, Essence (the content extraction subsystem within the Gatherer) removes the standard stoplist.cf types, which includes HTTP-Query (the cgi- bin stuff). SSyymmppttoomm There are _n_o _L_a_s_t_-_M_o_d_i_f_i_c_a_t_i_o_n_-_T_i_m_e or _M_D_5 _a_t_t_r_i_b_u_t_e_s in my gatherered SOIF data, so the Broker can't do duplicate elimination. SSoolluuttiioonn If you gather remote, manually-created information, it is pulled into Harvest using ``exploders'' that translate from the remote format into SOIF. That means they don't have a direct way to fill in the Last-Modification-Time or MD5 information per record. Note also that this will mean one update to the remote records would cause all records to look updated, which will result in more network load for Brokers that collect from this Gatherer's data. As a solution, you can compute MD5s for all objects, and store them as part of the record. Then, when you run the exploder you only generate timestamps for the ones for which the MD5s changed - giving you real last-modification times. SSyymmppttoomm The Gatherer substitutes a ``%7e'' for a ``~'' in all the user directory URLs. SSoolluuttiioonn The Gatherer conforms to RFC1738, which says that a tilde inside a URL should be encoded as ``%7e'', because it is considered an ``unsafe'' character. SSyymmppttoomm When I search using keywords I know are in a document I have indexed with Harvest, the _d_o_c_u_m_e_n_t _i_s_n_'_t _f_o_u_n_d. SSoolluuttiioonn Harvest uses a content extraction subsystem called _E_s_s_e_n_c_e that by default does not extract every keyword in a document. Instead, it uses heuristics to try to select promising keywords. You can change what keywords are selected by customizing the summarizers for that type of data, as discussed in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps''. Or, you can tell _E_s_s_e_n_c_e to use full text summarizing if you feel the added disk space costs are merited, as discussed in Section ``Setting variables in the Gatherer configuration file''. SSyymmppttoomm I'm running Harvest on HP-UX, but the essence process in the Gatherer _t_a_k_e_s _t_o_o _m_u_c_h _m_e_m_o_r_y. SSoolluuttiioonn The supplied regular expression library has memory leaks on HP- UX, so you need to use the regular expression library supplied with HP-UX. Change the _M_a_k_e_f_i_l_e in _s_r_c_/_g_a_t_h_e_r_e_r_/_e_s_s_e_n_c_e to read: REGEX_DEFINE = -DUSE_POSIX_REGEX REGEX_INCLUDE = REGEX_OBJ = REGEX_TYPE = posix SSyymmppttoomm I built the configuration files to _c_u_s_t_o_m_i_z_e how Essence types/content extracts data, but it _u_s_e_s _t_h_e _s_t_a_n_d_a_r_d _t_y_p_i_n_g_/_e_x_t_r_a_c_t_i_n_g mechanisms anyway. SSoolluuttiioonn Verify that you have the LLiibb--DDiirreeccttoorryy set to the _l_i_b_/ directory that you put your configuration files. LLiibb--DDiirreeccttoorryy is defined in your Gatherer configuration file. SSyymmppttoomm I am having problems _r_e_s_o_l_v_i_n_g _h_o_s_t _n_a_m_e_s on SunOS. SSoolluuttiioonn In order to gather data from hosts outside of your organization, your system must be able to resolve fully qualified domain names into IP addresses. If your system cannot resolve hostnames, you will see error messages such as ``Unknown Host.'' In this case, either: +o the hostname you gave does not really exist; or +o your system is not configured to use the DNS. To verify that your system is configured for DNS, make sure that the file _/_e_t_c_/_r_e_s_o_l_v_._c_o_n_f exists and is readable. Read the resolv.conf(5) manual page for information on this file. You can verify that DNS is working with the nslookup command. Some sites may use Sun Microsystem's Network Information Service (NIS) instead of, or in addition to, DNS. We believe that Harvest works on systems where NIS has been properly configured. The NIS servers (the names of which you can determine from the ypwhich command) must be configured to query DNS servers for hostnames they do not know about. See the --bb option of the ypxfr command. SSyymmppttoomm I cannot get the Gatherer to work across our _f_i_r_e_w_a_l_l _g_a_t_e_w_a_y. SSoolluuttiioonn Harvest only supports retrieving HTTP objects through a proxy. It is not yet possible to request Gopher and FTP objects through a firewall. For these objects, you may need to run Harvest internally (behind the firewall) or on the firewall host itself. If you see the ``Host is unreachable'' message, these are the likely problems: +o your connection to the Internet is temporarily down due to a circuit or routing failure; or +o you are behind a firewall. If you see the ``Connection refused'' message, the likely problem is that you are trying to connect with an unused port on the destination machine. In other words, there is no program listening for connections on that port. The Harvest gatherer is essentially a WWW client. You should expect it to work the same as any Web browser. 55.. TThhee BBrrookkeerr 55..11.. OOvveerrvviieeww The Broker retrieves and manages indexing information from Gatherers and other Brokers, and provides a WWW query interface to the indexing information. 55..22.. BBaassiicc sseettuupp The Broker is automatically started by the RunHarvest command. Other relevant commands are described in Section ``Starting up the system: RunHarvest and related commands''. In the current section we discuss various ways users can customize and tune the Broker, how to administrate the Broker, and the various Broker programming interfaces. As suggested in Figure ``1'', the Broker uses a flexible indexing interface that supports a variety of indexing subsystems. The default Harvest Broker uses Glimpse as indexer, but other indexers such as Swish, and WAIS (both freeWAIS and commercial WAIS <ftp://ftp.cnidr.org/pub/software/freewais/>), also work with the Broker (see Section ``Using different index/search engines with the Broker''). To create a new Broker, run the CreateBroker program. It will ask you a series of questions about how you'd like to configure your Broker, and then automatically create and configure it. To start your Broker, use the RunBroker program that CreateBroker generates. The Broker should be started when your system reboots. To prevent a collection while starting the broker, use the --nnooccooll option. There are a number of ways you can customize or tune the Broker, discussed in Sections ``Tuning Glimpse indexing in the Broker'' and ``Using different index/search engines with the Broker''. You may also use the RunHarvest command, discussed in Section ``Starting up the system: RunHarvest and related commands'', to create both a Broker and a Gatherer. 55..33.. QQuueerryyiinngg aa BBrrookkeerr The Harvest Broker can handle many types of queries. The queries handled by a particular Broker depend on what index/search engine is being used inside of it (e.g., WAIS does not support some of the queries that Glimpse does). In this section we describe the full syntax. If a particular Broker does not support a certain type of query, it will return an error when the user requests that type of query. The simplest query is a single keyword, such as: lightbulb Searching for common words (like ``computer'' or ``html'') may take a lot of time. Particularly for large Brokers, it is often helpful to use more powerful queries. Harvest supports many different index/search engines, with varying capabilities. At present, our most powerful (and commonly used) search engine is Glimpse, which supports: +o case-insensitive and case-sensitive queries; +o matching parts of words, whole words, or multiple word phrases (like ``resource discovery''); +o Boolean (AND/OR) combinations of keywords; +o approximate matches (e.g., allowing spelling errors); +o structured queries (which allow you to constrain matches to certain attributes); +o displaying matched lines or entire matching records (e.g., for citations); +o specifying limits on the number of matches returned; and +o a limited form of regular expressions (e.g., allowing ``wild card'' expressions that match all words ending in a particular suffix). The different types of queries (and how to use them) are discussed below. Note that you use the same syntax regardless of what index/search engine is running in a particular Broker, but that not all engines support all of the above features. In particular, some of the Brokers use WAIS, which sometimes searches faster than Glimpse but supports only Boolean keyword queries and the ability to specify result set limits. The different options - case-sensitivity, approximate matching, the ability to show matched lines vs. entire matching records, and the ability to specify match count limits - can all be specified with buttons and menus in the Broker query forms. A structured query has the form: tag-name : value where _t_a_g_-_n_a_m_e is a Content Summary attribute name, and _v_a_l_u_e is the search value within the attribute. If you click on a Content Summary, you will see what attributes are available for a particular Broker. A list of common attributes is shown in Section ``List of common SOIF attribute names''. Keyword searches and structured queries can be combined using Boolean operators (AND and OR) to form complex queries. Lacking parentheses, logical operation precedence is based left to right. For multiple word phrases or regular expressions, you need to enclose the string in double quotes, e.g., "internet resource discovery" or "discov.*" Double quotes should also be used when searching for non-alphanumeric characters. 55..33..11.. EExxaammppllee qquueerriieess SSiimmppllee kkeeyywwoorrdd sseeaarrcchh qquueerryy:: _A_r_i_z_o_n_a This query returns all objects in the Broker containing the word _A_r_i_z_o_n_a. BBoooolleeaann qquueerryy:: _A_r_i_z_o_n_a _A_N_D _d_e_s_e_r_t This query returns all objects in the Broker that contain both words anywhere in the object in any order. PPhhrraassee qquueerryy:: _"_A_r_i_z_o_n_a _d_e_s_e_r_t_" This query returns all objects in the Broker that contain _A_r_i_z_o_n_a _d_e_s_e_r_t as a phrase. Notice that you need to put double quotes around the phrase. BBoooolleeaann qquueerriieess wwiitthh pphhrraasseess:: _"_A_r_i_z_o_n_a _d_e_s_e_r_t_" _A_N_D _w_i_n_d_s_u_r_f_i_n_g This query returns all objects in the Broker that contain _A_r_i_z_o_n_a _d_e_s_e_r_t as a phrase and the word windsurfing. SSiimmppllee SSttrruuccttuurreedd qquueerryy:: _T_i_t_l_e _: _w_i_n_d_s_u_r_f_i_n_g This query returns all objects in the Broker where the _T_i_t_l_e attribute contains the value _w_i_n_d_s_u_r_f_i_n_g. CCoommpplleexx qquueerryy:: _"_A_r_i_z_o_n_a _d_e_s_e_r_t_" _A_N_D _(_T_i_t_l_e _: _w_i_n_d_s_u_r_f_i_n_g_) This query returns all objects in the Broker that contain the phrase _A_r_i_z_o_n_a _d_e_s_e_r_t and where the _T_i_t_l_e attribute of the same object contains the value _w_i_n_d_s_u_r_f_i_n_g. 55..33..22.. RReegguullaarr eexxpprreessssiioonnss Some types of regular expressions are supported by Glimpse. A regular expression search can be much slower that other searches. The following is a partial list of possible patterns. (For more details see the Glimpse documentations.) +o _^_j_o_e will match ``joe'' at the beginning of a line. +o _j_o_e_$ will match ``joe'' at the end of a line. +o _[_a_-_h_o_-_z_] matches any character between ``a'' and ``h'' or between ``o'' and ``z''. +o _. matches any single character except newline. +o _c_* matches zero or more occurrences of the character ``c''. +o _._* matches any number of characters except newline. +o _\_* matches the character ``*''. (_\ escapes any of the above special characters.) Regular expressions are currently limited to approximately 30 characters, not including meta characters. Regular expressions will generally not cross word boundaries (because only words are stored in the index). So, for example, _"_l_i_n_._*_i_n_g_" will find ``linking'' or ``flinching,'' but not ``linear programming.'' 55..33..33.. QQuueerryy ooppttiioonnss sseelleecctteedd bbyy mmeennuuss oorr bbuuttttoonnss The query page may have following checkboxes to allow some control of the query specification. CCaassee iinnsseennssiittiivvee:: By selecting this checkbox the query will become case insensitive (lower case and upper case letters don't differ). Otherwise, the query will be case sensitive. The default is case insensitive. KKeeyywwoorrddss mmaattcchh oonn wwoorrdd bboouunnddaarriieess:: By selecting this checkbox, keywords will match on word boundaries. Otherwise, a keyword will match part of a word (or phrase). For example, "network" will match ``networking'', "sensitive" will match ``insensitive'', and "Arizona desert" will match ``Arizona desertness''. The default is to match keywords on word boundaries. NNuummbbeerr ooff eerrrroorrss aalllloowweedd:: Glimpse allows the search to contain a number of errors. An error is either a deletion, insertion, or substitution of a single character. The Best Match option will find the match(es) with the least number of errors. The default is 0 (zero) errors. _N_o_t_e_: The previous three options do not apply to attribute names. Attribute names are always case insensitive and allow no errors. 55..33..44.. FFiilltteerriinngg qquueerryy rreessuullttss Harvest allows to filter the results of a query by any query term using any attribute defined in the ``List of common SOIF attribute names''. This is done by defining ffiilltteerr parameters in the query form. It is possible to define more that one filter parameter; they will be concatenated by boolean AANNDD. Filter parameters consist of two parts, separated by the pipe symbol ``|''. The first part is a query expression which is attached to the user query using AANNDD before sending the request to the broker. The optional second part is a HTML text that shall be displayd on the results page, to give the user some information on the applied filter. Example: <SELECT NAME="filter"> <OPTION VALUE=''>No Filter <OPTION VALUE='uri: "xyz\.edu"|Seach only xyz.edu'>Search xyz.edu only <OPTION VALUE='type: html|HTML documents only'>Search HTML documents only </SELECT> The first option returns an unfiltered output. The second option returns only pages found on pages with ``xyz.edu'' in their URL. The third option returns only HTML-documents. See the advanced search page of the broker for more examples. 55..33..55.. RReessuulltt sseett pprreesseennttaattiioonn The query page may have following checkboxes allow some control of presentation of the query return. DDiissppllaayy mmaattcchheedd lliinneess ((ffrroomm ccoonntteenntt ssuummmmaarriieess)):: By selecting this checkbox, the result set presentation will contain the lines of the Content Summary that matched the query. Otherwise, the matched lines will not be displayed. The default is to display the matched lines. DDiissppllaayy oobbjjeecctt ddeessccrriippttiioonnss ((iiff aavvaaiillaabbllee)):: Some objects have short, one-line descriptions associated with them. By selecting this checkbox, the descriptions will be presented. Otherwise, the object descriptions will not be displayed. The default is to display object descriptions. DDiissppllaayy lliinnkkss ttoo iinnddeexxeedd ccoonntteenntt ssuummmmaarryy:: This checkbox allows you to set whether links to the indexed content summaries are displayed or not. The default is not to display links to inexed content summaries. 55..44.. CCuussttoommiizziinngg tthhee BBrrookkeerr''ss QQuueerryy RReessuulltt SSeett It is possible for the Harvest administrator to customize how the Broker query result set is generated, by modifying a configuration file that is interpreted by the search.cgi Perl program at query result time. search.cgi allows you to customize almost every aspect of its HTML output. The file _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-_b_i_n_/_l_i_b_/_s_e_a_r_c_h_._c_f contains the default output d