Customizing the candidate selection step


The lib/ configuration file contains a list of types that are rejected by Essence. You can add or delete types from lib/ to control the candidate selection step.

  To direct Essence to index only certain types, you can list the types to index in lib/ Then, supply Essence with the --allowlist flag.

The file and URL naming heuristics used by the type recognition step (described in Section 4.5.4) are particularly useful for candidate selection when gathering remote data. They allow the Gatherer to avoid retrieving files that you don't want to index (in contrast, recognizing types by locating identifying data within a file requires that the file be retrieved first). This approach can save quite a bit of network traffic, particularly when used in combination with enumerated RootNode URLs. For example, many sites provide each of their files in both a compressed and uncompressed form. By building a lib/ containing only the Compressed types, you can avoid retrieving the uncompressed versions of the files.

Duane Wessels
Wed Jan 31 23:46:21 PST 1996