Next: 4.5.1 Default actions of Up: 4 The Gatherer Previous: 4.4 Generating LeafNode/RootNode URLs

## 4.5 Extracting data for indexing: The Essence summarizing subsystem

After the Gatherer retrieves a document, it passes the document through a subsystem called Essence [11,10] to extract indexing information. Essence allows the Gatherer to collect indexing information easily from a wide variety of information, using different techniques depending on the type of data and the needs of the particular corpus being indexed. In a nutshell, Essence can determine the type of data pointed to by a URL (e.g., PostScript vs. HTML), unravel'' presentation nesting formats (such as compressed tar'' files), select which types of data to index (e.g., don't index Audio files), and then apply a type-specific extraction algorithm (called a summarizer) to the data to generate a content summary. Users can customize each of these aspects, but often this is not necessary: Harvest is distributed with a stock'' set of type recognizers, presentation unnesters, candidate selectors, and summarizers that work well for many applications.

Starting with Harvest Version 1.2 we are also integrating support for summarizers based on outside component technologies'' of both a free and a commercial nature.

Below we describe the stock summarizer set, the current components distribution, and how users can customize summarizers to change how they operate and add summarizers for new types of data. If you develop a summarizer (or an interface to a commercial system) that is likely to be useful to other users, please notify us via email at harvest-dvl@cs.colorado.edu so we may include it in our components distribution.

Type            Summarizer Function
--------------------------------------------------------------------
Audio           Extract file name
Bibliographic   Extract author and titles
Binary          Extract meaningful strings and manual page summary
Dvi             Invoke the Text summarizer on extracted ASCII text
Extract all words in file
Framemaker      Up-convert to SGML and pass through SGML summarizer
HTML            Extract anchors, hypertext links, and selected fields (see SGML)
LaTex           Parse selected LaTex fields (author, title, etc.)
Makefile        Extract comments and target names
ManPage         Extract synopsis, author, title, etc., based on -man'' macros
Object          Extract symbol table
Patch           Extract patched file names
Perl            Extract procedure names and comments
PostScript      Extract text in word processor-specific fashion, and pass
through Text summarizer.
RCS, SCCS       Extract revision control summary
RTF             Up-convert to SGML and pass through SGML summarizer
SGML            Extract fields named in extraction table (see Section~\ref{sec:sgml})
SourceDistribution
and source code files, and summarize any manual pages
SymbolicLink    Extract file name, owner, and date created
Tex             Invoke the Text summarizer on extracted ASCII text
Text            Extract first 100 lines plus first sentence of each
remaining paragraph
Troff           Extract author, title, etc., based on -man'', -ms'',
-me'' macro packages, or extract section headers and
topic sentences.
Unrecognized    Extract file name, owner, and date created.


Next: 4.5.1 Default actions of Up: 4 The Gatherer Previous: 4.4 Generating LeafNode/RootNode URLs

Duane Wessels
Wed Jan 31 23:46:21 PST 1996