Harvest is a system to collect information and make them searchable using a web interface. Harvest can collect information on inter- and intranet using http, ftp, nntp as well as local files like data on harddisk, CDROM and file servers. Current list of supported formats in addition to HTML include TeX, DVI, PS, full text, mail, man pages, news, troff, WordPerfect, RTF, Microsoft Word/Excel, SGML, C sources and many more. Stubs for PDF support is included in Harvest and will use Xpdf or Acroread to process PDF files. Adding support for new format is easy due to Harvest's modular design.
Harvest is a modular, distributed search system framework with a working set components to make it a complete search system. The default setup is to be a web search engine, but it is also much more and provides following features:
Harvest can be used to provide search systen to websites. This is most commonly used setup for Harvest. While this works well for many sites, you may also want to take a look at more webcentric systems listed on the links on the Harvest homepage.
Harvest provides a wide range of methods to process (transform, delete) the gathered objects depending on file name, file content, and retrieving protocol. Using these features, you can build a search system for special purposes. Harvest distribution includes some examples to build a RFC search system, as well as Mail, News, and some more search systems.
Harvest is designed to be a distributed search system where machines work together to handle the load which a single machine could not handle. Harvest also can be used to save bandwidth by deploying gatherers near the data source and exchanging the summarized data which usually is much smaller than the original data.
Harvest can be used for experiments with search components. Harvest's modularity allows to replace a component. If you are interested in for example, crawlers, you can use the rest of Harvest to build a test system without having to build a complete search system. If you are developing fulltext engines, you can use harvest to gather and prepare data.
The core of Harvest is licensed under GPL. The components distributed with Harvest are also under GPL or similar license.
Link to query page of Harvest demo site. There are also query pages of some sites using Harvest. You can test Harvest before downloading and installing it on your site.
Download information and links to release notes.
Codes and documents provided by Harvest users.
List of open tasks.
Collection of Harvest and search related links.
List of people who contributed to Harvest.
Harvest User's Manual provides information about the stable versions of Harvest. It was not updated for the current development versions, yet. For now, please refer to NEWS for an overview of changes and ChangeLog for a detailed list of changes in Harvest.
A list of Frequently Asked Questions with answers.
Harvest quick installation instructions.
List of changes in Harvest sources.
List of news in Harvest sources.