lee@arco.deI am recovering from work piled up during my absence in Berlin, so here are my late conclusions from SINN 02.
My initial plan was to start 1.9 versions of Harvest after releasing Harvest 1.8, but after getting bugreport (Sutapa) and Code (Michael), I want to delay 1.9 for a week or two.
It would be nice if you could mail me any wishes, bugreports and code modifications you made, the modified HTML summarizer (Thomas?) and modified ranking (again Thomas) comes to my mind, so I can merge any changes into the current stable tree. When this is finished, I will dedicate my efforts into 1.9 version. Then the 1.8 tree will only get critical bugfixes and any new translations of the user interface, when I receive them.
After seeing that the userbase of Harvest is larger than I expected and that there is still interest in further development of Harvest, it seems that there are potentials to speed up the development of Harvest.
It would be nice if we could coordinate our development efforts to avoid features implemented twice or do something "right" instead of trying to fix some problems in its own limited area, where a more general approach should have fixed that special problem and other potential problems as well.
What I would like to see is code contribution, testing, translations and feature requests. Don't hesitate to contact me even if you think that your proposal may be too simple, too naive, too complicated or too whatever. There were many cases where I was amazed, how I could have overlooked or forgot the most obvious solution to a problem sometimes.
It would also be very nice, if I could raise some funding to dedicate more time on developing Harvest.
Depending how much support I get, it would also be possible to speed up the development significantly. It might even possible to put more ambitious tasks into my to-do list since it would be possible to switch from thinking in terms of man days and weeks to man months and years.
I want to thank Thomas for his patience and good arguments to convince me to support XML-Query, even with the backing of the major players of the industry. I felt quite honored by his efforts.
What really made the point was Dr. Heinrich Stamerjohanns presentation and that the rest of the world might be interested in interaction with Harvest. So, I see that it may be useful to switch from SOIF to XML, even though I still consider SOIF superior, at least in Harvest's context, because SOIF is more simple, i.e. easier to parse, and not as verbose as XML, i.e. saves bandwidth, disk space and memory.
The only technical reason for XML I heard was that it is more stable in case of data corruption. Data corruption wasn't a big issue in Harvest. In other cases, we should try to take care that data doesn't get corrupted at first place instead of making expensive checks and still losing data because we see that it was corrupted.
I will do some testing and if the file size grows 5-10% when converted from SOIF to XML, I will switch to XML as soon as feasible. When it grows up to 50%, I will think hard next few days and finally decide to switch to XML. When it grows even larger, than I will switch when I update my development machine from some hundred MHz to some GHz beast.
Since this is an "up-converting", i.e. transform from a simple to more versatile format, this should be easy. The transition will be especially smooth since the Indexdata's Zebra already supports XML. I noticed that the data format of OpenOffice was XML (thanks to Thomas for this pointer). The idea of not needing any summarizer is very apealing, although it will turn out that we will still need a summarizer, even it may be called a normalizer instead. From what I hear, Microsoft plans to switch to XML for the next version of Office, too, so this scheme should be also usable there, at least until Microsoft will start playing the usual "embrace and extend" game, but we can cope with that problem, when we actually have the problem.
All that said, when having the documents in XML format it may be useful also to support XML-Query.
It is quite possible that a third party will come up with Z39.50 to XML-Query gateway once XML-Query becomes popular enough. While full compliance of the XML-Query standard may be hard to achieve this way, it should be possible to get a useful subset of XML-Query implementation.
If nobody beats me in implementing such a gateway, I will try to do it myself. To get a working subset shouldn't be very hard, since both Z39.50 and XML-Query seem to have similar functionality, at least in the basic query cases. It would be perfect if the Indexdata folks are interested in XML-Query, too.
On the other hand, writing a fulltext engine with a more complex query language than SQL will possibly take at least 10 man years, in other words impossible considering the resources I can put into development of Harvest. So to get more than a gateway, I have to plug in an external fulltext engine with XML-Query support, when it becomes available.
I underestimated the meta tags after doing testing with "real world data". I will evaluate and try to merge the already existing changes to the HTML summarizer (Thomas?) and try to improve the summarizers when processing meta tags.
Perhaps the Math-Net folks also have some usable modifications in this areas, too. I will ask them when I meet them next week.
The language dependant part of the user interface are located in:
The first step to create a localised user interface should be creating your own configuration file and query page based on other examples in their directories.
For complete localisation you might also want to translate src/broker/examples/brokers/*.html, which are various help pages.
I have created a mailinglist for Harvest and would like to invite everybody interested in Harvest to join the mailinglist.
Use following URL to join the mailinglist:
https://lists.sourceforge.net/lists/listinfo/harvest-devel
It would be nice if anybody could provide a news gateway to news:comp.infosystems.harvest.
My ultimate goal is to provide a software framework flexible enough to scale from a simple site search to full internet search.
I have no vision about technical innovation just for the sake of technical innovation, but my vision is to deliver a usable product making optimal use of given (human) resources. I will use any available techniques to achieve the goal and make innovations where necessary.
To put it simple, I want to see "Google switched to Harvest" on the frontpage of newspapers. :-)
I want to thank all the people at SINN 02 for feedback and nice ideas. I also want to thank the folks from ISN for giving me an opportunity to attend SINN 02.
Finally, I want to thank everybody for reading this document. :-)