next up previous contents index
Next: Using programs to Up: C.4 Example 4 - Previous: C.4 Example 4 -

Using regular expressions to summarize a format

The first new format is the ``ReferBibliographic'' type which is the format that the refer program uses to represent bibliography information. To recognize that a file is in this format, we'll use the convention that the filename ends in ``.referbib''. So, we add that naming heuristic as a type recognition customization. Naming heuristics are represented as a regular expression against the filename in the lib/byname.cf file:

 

        ReferBibliographic      ^.*\.referbib$

Now, to write a summarizer for this type, we'll need a sample ReferBibliographic file:

        %A A. S. Tanenbaum
        %T Computer Networks
        %I Prentice Hall
        %C Englewood Cliffs, NJ
        %D 1988

Essence summarizers extract structured information from files. One way to write a summarizer is by using regular expressions to define the extractions. For each type of information that you want to extract from a file, add the regular expression that will match lines in that file to lib/quick-sum.cf. For example, the following regular expressions in lib/quick-sum.cf will extract the author, title, date, and other information from ReferBibliographic files:

        ReferBibliographic      Author                  ^%A[ \t]+.*$
        ReferBibliographic      City                    ^%C[ \t]+.*$
        ReferBibliographic      Date                    ^%D[ \t]+.*$
        ReferBibliographic      Editor                  ^%E[ \t]+.*$
        ReferBibliographic      Comments                ^%H[ \t]+.*$
        ReferBibliographic      Issuer                  ^%I[ \t]+.*$
        ReferBibliographic      Journal                 ^%J[ \t]+.*$
        ReferBibliographic      Keywords                ^%K[ \t]+.*$
        ReferBibliographic      Label                   ^%L[ \t]+.*$
        ReferBibliographic      Number                  ^%N[ \t]+.*$
        ReferBibliographic      Comments                ^%O[ \t]+.*$
        ReferBibliographic      Page-Number             ^%P[ \t]+.*$
        ReferBibliographic      Unpublished-Info        ^%R[ \t]+.*$
        ReferBibliographic      Series-Title            ^%S[ \t]+.*$
        ReferBibliographic      Title                   ^%T[ \t]+.*$
        ReferBibliographic      Volume                  ^%V[ \t]+.*$
        ReferBibliographic      Abstract                ^%X[ \t]+.*$

The first field in lib/quick-sum.cf is the name of the type. The second field is the Attribute under which to extract the information on lines that match the regular expression in the third field.

 


Duane Wessels
Wed Jan 31 23:46:21 PST 1996