<!doctype linuxdoc system>

<article>

<title>Conclusions after SINN 02

<author>Kang-Jin Lee <tt/lee@arco.de/

<date>2002-11-17

<abstract>
Personal conclusions for the future development of Harvest after
getting feedback from folks I met on <htmlurl
url="http://isn-oldenburg.de/projects/SINN/sinn02/"
name="SINN 02">.

<toc>

<sect>Recovering from SINN 02

<p>
I am recovering from work piled up during my absence in Berlin, so
here are my late conclusions from SINN 02.

<sect>Development Plan

<p>
My initial plan was to start 1.9 versions of Harvest after releasing
Harvest 1.8, but after getting bugreport (Sutapa) and Code (Michael),
I want to delay 1.9 for a week or two.

It would be nice if you could mail me any wishes, bugreports and code
modifications you made, the modified HTML summarizer (Thomas?) and
modified ranking (again Thomas) comes to my mind, so I can merge any
changes into the current stable tree. When this is finished, I will
dedicate my efforts into 1.9 version. Then the 1.8 tree will only get
critical bugfixes and any new translations of the user interface, when
I receive them.

After seeing that the userbase of Harvest is larger than I expected
and that there is still interest in further development of Harvest,
it seems that there are potentials to speed up the development of
Harvest.

It would be nice if we could coordinate our development efforts to
avoid features implemented twice or do something "right" instead of
trying to fix some problems in its own limited area, where a more
general approach should have fixed that special problem and other
potential problems as well.

What I would like to see is code contribution, testing, translations
and feature requests. Don't hesitate to contact me even if you think
that your proposal may be too simple, too naive, too complicated or
too whatever. There were many cases where I was amazed, how I could
have overlooked or forgot the most obvious solution to a problem
sometimes.

It would also be very nice, if I could raise some funding to dedicate
more time on developing Harvest.

Depending how much support I get, it would also be possible to speed
up the development significantly. It might even possible to put more
ambitious tasks into my to-do list since it would be possible to
switch from thinking in terms of man days and weeks to man months and
years.

<sect>XML and XML-Query

<p>
I want to thank Thomas for his patience and good arguments to convince
me to support XML-Query, even with the backing of the major players of
the industry. I felt quite honored by his efforts.

What really made the point was Dr. Heinrich Stamerjohanns presentation
and that the rest of the world might be interested in interaction with
Harvest. So, I see that it may be useful to switch from SOIF to XML,
even though I still consider SOIF superior, at least in Harvest's
context, because SOIF is more simple, i.e. easier to parse, and not as
verbose as XML, i.e. saves bandwidth, disk space and memory.

The only technical reason for XML I heard was that it is more stable
in case of data corruption. Data corruption wasn't a big issue in
Harvest. In other cases, we should try to take care that data doesn't
get corrupted at first place instead of making expensive checks and
still losing data because we see that it was corrupted.

I will do some testing and if the file size grows 5-10% when converted
from SOIF to XML, I will switch to XML as soon as feasible. When it
grows up to 50%, I will think hard next few days and finally decide to
switch to XML. When it grows even larger, than I will switch when I
update my development machine from some hundred MHz to some GHz
beast.

Since this is an "up-converting", i.e. transform from a simple to more
versatile format, this should be easy. The transition will be
especially smooth since the Indexdata's Zebra already supports XML. I
noticed that the data format of OpenOffice was XML (thanks to Thomas
for this pointer). The idea of not needing any summarizer is very
apealing, although it will turn out that we will still need a
summarizer, even it may be called a normalizer instead. From what I
hear, Microsoft plans to switch to XML for the next version of Office,
too, so this scheme should be also usable there, at least until
Microsoft will start playing the usual "embrace and extend" game, but
we can cope with that problem, when we actually have the problem.

All that said, when having the documents in XML format it may be
useful also to support XML-Query.

It is quite possible that a third party will come up with Z39.50 to
XML-Query gateway once XML-Query becomes popular enough. While full
compliance of the XML-Query standard may be hard to achieve this way,
it should be possible to get a useful subset of XML-Query
implementation.

If nobody beats me in implementing such a gateway, I will try to do it
myself. To get a working subset shouldn't be very hard, since both
Z39.50 and XML-Query seem to have similar functionality, at least in
the basic query cases. It would be perfect if the Indexdata folks are
interested in XML-Query, too.

On the other hand, writing a fulltext engine with a more complex query
language than SQL will possibly take at least 10 man years, in other
words impossible considering the resources I can put into development
of Harvest. So to get more than a gateway, I have to plug in an
external fulltext engine with XML-Query support, when it becomes
available.

<sect>Meta Tags

<p>
I underestimated the meta tags after doing testing with "real world
data". I will evaluate and try to merge the already existing changes
to the HTML summarizer (Thomas?) and try to improve the summarizers
when processing meta tags.

Perhaps the Math-Net folks also have some usable modifications in this
areas, too. I will ask them when I meet them next week.

<sect>Localisation Instruction

<p>
The language dependant part of the user interface are located in:

<itemize>
<item><bf>components/broker/standard/WWW/your_language.cf</bf>
<item><bf>src/broker/example/brokers/skeleton/query-glimpse-modern.html.xx.in</bf>
</itemize>

The first step to create a localised user interface should be creating
your own configuration file and query page based on other examples in
their directories.

For complete localisation you might also want to translate
<bf>src/broker/examples/brokers/*.html</bf>, which are various help
pages.

<sect>Mailinglist for Harvest

<p>
I have created a mailinglist for Harvest and would like to invite
everybody interested in Harvest to join the mailinglist.

Use following URL to join the mailinglist:

<htmlurl
url="https://lists.sourceforge.net/lists/listinfo/harvest-devel"
name="https://lists.sourceforge.net/lists/listinfo/harvest-devel">

It would be nice if anybody could provide a news gateway to
<url url="news:comp.infosystems.harvest"
name="news:comp.infosystems.harvest">.

<sect>Vision

<p>
My ultimate goal is to provide a software framework flexible enough to
scale from a simple site search to full internet search.

I have no vision about technical innovation just for the sake of
technical innovation, but my vision is to deliver a usable product
making optimal use of given (human) resources. I will use any
available techniques to achieve the goal and make innovations where
necessary.

To put it simple, I want to see "Google switched to Harvest" on the
frontpage of newspapers. :-)

<sect>Thanks

<p>
I want to thank all the people at SINN 02 for feedback and nice
ideas. I also want to thank the folks from ISN for giving me an
opportunity to attend SINN 02.

Finally, I want to thank everybody for reading this document. :-)

</article>
