Planned changes and features on the way to next stable release of Harvest What won't change: ================== Harvest's basic design - Gatherer -> Summarizer -> SOIF -> Broker -> CGI - Gatherd -> Broker, Broker -> Broker - Some improvements might be necessary, but the basic design is still ok SOIF for exchanging data between Harvest's components - No need for switching to XML, but may become useful in future since there is much work in this area recently Development model - Use available components from other sources - Save time - Will give standard conformance without any effords to follow the standards - Split up big changes into many small changes - Document all changes - Frequent and early releases - It is easy to follow the development - Backup General Goals: ============== Increase search speed More focus to HTTP and HTML Internationalization - User interface - support for non-latin characters for the fulltext engine Improve Scalability - for small systems by reducing some overhead for sites using Harvest as a simple search system - for large and distributed systems Increase availability - minimize downtimes during Harvest system maintainance - data expire, data import, indexupdate Improve acces control - Access control for gatherer and broker - Access controlled by Load/User/IP Address/Time Integrate other search systems into Harvest system - Write Gatherd for other search systems like htdig, aspseek und mnogosearch Remove all non GPLed components from Harvest Remove Glimpse - Importing from Brokers using filters needs reimplemented - swish may work, better go with Zebra Improve Ranking - provide hooks for external ranking algorithms Promote Harvest to attract more users and developers Gatherer: ========= Shift focus to HTTP Improve gathering over slow connection Improve http gatherer - how to deal with dynamic content which creates different URLs. - how to deal with sites blocking based on browser version - how to deal with duplicate pages created by mirrors Create multiple gatherer on the fly if possible Evaluate larbin and curl Migrate from GDBM to Sleepycat Berkeley db for cache management Remove local disc cache Implement candidate selection filter for httpenum based on mime type Trust mime type sent by http servers Add HTTPS support Evaluate improvements of HTTP1.1 over HTTP1.0 Remove unnesters and use exploders for nested objects Improve object storage system for more scalability - Migrate to Sleepycat db as gatherer storage - Store Gatherer objects as plain files Improve expiring obsolete entries - instead of setting an arbitrary ttl, actively expire pages by checking with its source Add expire daemon? Split file: and news: URLs into leafnodes Remove All-Templates? Make SOIF objects sharable between Gatherer and Broker if possible. (Gatherer and Broker running on same machine) Summarizer: =========== Shift focus to HTML Improve HTML summarizer. - HTML summarizer written in C - HTML summarizer written in Perl - SGML based HTML summarizer HTML summarizer should know more about HTML, like adding adding a newline if it sees a
or
Improve support for Microsoft Office documents.
Broker:
=======
Add Idexdata's Zebra as fulltext indexer
- Z39.50 support
- Incremental indexing
- Most features supported by Glimpse, NEAR, builtin ranking
- stage 1: as external indexer without/minumum integration into
Harvest
- stage 2: Full integration as replacement for Glimpse
Implement a method to retrieve SOIF object by URI in Broker
- instead of via filesystem or http for DisplaySOIF. displaySOIF
will work even if it runs on remote machine without http daemon
Write SOIF filter for Namazu
Improve $HARVEST_HOME/tmp handling which is created by CreateBroker,
used by search.cgi. glimpseindex will clean up the directory. This
is somewhat messy procedure. Make it more clear by reorganizing
search.cgi and CreateBroker.
Store SOIF objects as single files with the md5sum of URL or the
content of the document as filename.
This would allow to use some number of directory depth suitable to
store the desired number of objects (see squid).
Using md5sum may make it necessary to to a collision handling
Extend the "shell indexer" functionality used in glimpseindex
Implement a user interface in PHP
Separate Data and metadata for Objects trennen.
- it's possible to determine if an object has changed by using
stat() instead of open(), read()
Minimize size of Registry
- Remove some fields like OID, Desc
- Store some fields as "tables": Gatherer Host, Name, Version with
an identifier pointing to the entry.
- Don't keep Registry in memory to lower memory consumption and
for faster start.
- Run expire as an separate daemon
- what to keep in Registry except URL Update-Time and Refreshrate?
Use cookies to save user preferences.
Evaluate Namazu
Evaluate Postgresql, MySQL
Evaluate XQuery and SOAP if userbase of theses standards becomes big
enough
Documentation:
==============
Switch from linuxdoc to docbook for manual and FAQ.
Problems:
=========
PS and PDF summarizer doesn't work well and no easy way to improve
this situation
multiviews in Apache causes problems when gathering
IMS Gathering problematisch/Fehlerhaft possibly due to
normalization/escaping problems
Stemming and Soundex are language dependant. Problem to recognize
the language used by the document
No usable Thesauri available