Harvest is a system to collect information and make them searchable using a web interface. Harvest can collect information on inter- and intranet using http, ftp, nntp as well as local files like data on harddisk, CDROM and file servers. Current list of supported formats in addition to HTML include TeX, DVI, PS, full text, mail, man pages, news, troff, WordPerfect, RTF, Microsoft Word/Excel, SGML, C sources and many more. Stubs for PDF support is included in Harvest and will use Xpdf or Acroread to process PDF files. Adding support for new format is easy due to Harvest's modular design.
See Harvest homepage http://harvest.sourceforge.net/ for informations about Harvest.
Harvest is available for download at Harvest download page http://prdownloads.sourceforge.net/harvest/.
Andrei Malashevich has translated the Harvest User's Manual to Russian. It is available at his Harvest User's Manual page at http://baby.chg.ru/manual_harvest/.
Harvest-ng is a reimplementation of Harvest's gatherer by Simon Wilkinson. You can get more info about Harvest-ng at Harvest-ng homepage http://webharvest.sourceforge.net/ng/.
The core of Harvest located in src directory is under GPL. Additional components, located in components directory are under GPL or similar copyright.
Harvest should run on any *nix like platforms including FreeBSD, Linux and Solaris.
Michael Schlenker has ported Harvest to Windows platforms using Cygwin http://sources.redhat.com/cygwin/.
A Pentium 120MHz with 64MB RAM should achieve reasonable performance for around 350 MB of fulltext data in ca. 20.000 objects. A Pentium 650MHz with 256MB RAM should be able to handle around 1.5 GB of fulltext data in ca. 100.000 objects.
After the original authors ceased working on Harvest, there were some periods where Harvest was unmaintained. During this time there were following forked versions of Harvest:
All these forked trees were merged into Harvest 1.6.
For initial setup, you must be able to modify the webserver configuration and to schedule cron jobs. After the initial setup, it is recommended to run Harvest as a different user for security reasons.
Put a line like this to your robots.txt:
User-agent: Harvest Disallow: /
There are many ways to help depending your skills and time you want to contribute to improve Harvest: