next up previous contents index
Next: Location of support Up: 4.5 Extracting data for Previous: 4.5.1 Default actions of

4.5.2 Summarizing SGML data


Starting with Harvest Version 1.2, it is possible to summarize documents that conform to the Standard Generalized Markup Language (SGML) [12], for which you have a Document Type Definition (DTD). gif The World-Wide Web's Hypertext Mark-up Language (HTML) is actually a particular application of SGML, with a corresponding DTD. (In fact, the Harvest HTML summarizer now uses the HTML DTD and our SGML summarizing mechanism, which provides various advantages; see Section 4.5.2.) SGML is being used in an increasingly broad variety of applications, for example as a format for storing data for a number of physical sciences. Because SGML allows documents to contain a good deal of structure, Harvest can summarize SGML documents very effectively.

The SGML summarizer ( SGML.sum) uses the sgmls program by James Clark to parse the SGML document. The parser needs both a DTD for the document and a Declaration file that describes the allowed character set. The SGML.sum program uses a table that maps SGML tags to SOIF attributes.

Duane Wessels
Wed Jan 31 23:46:21 PST 1996