RE: DigitalConvergence.com: Look out mouse, here comes the cat...

Date view Thread view Subject view Author view

From: Eugene Leitl (eugene.leitl@lrz.uni-muenchen.de)
Date: Mon Sep 18 2000 - 01:39:45 PDT


This is very stale, but I have to reboot my machine, so no mail in the
queue will remain unanswered ;)

Gavin Thomas Nicol writes:
> > Web server? All you need is a mechanism which adds an URI plus
> > fulltext index of the text fields (don't get me started on semantic
> > web) of the inserted object at each insert (inverse operation at
> > delete). If it's possible, you could just iterate over all objects in
> > a database which are allowed to be indexed. Then place the index in a
> > standard location, and (maybe) notify the web spider.
>
> I'm not quite sure I understanmd what you're proposing. If you
> have a single document... say the Boeing 747 manuals, and
> provide a means to (at runtime) convert that into pages, TOC's
> etc. and have all that done using a number of possible stylesheets,
> it's basically impossible to generate all the links for a robot
> to consume. It get's worse when you take the browser into
> account, or user preferences.
 
No, no, no. You want to present a document to the world. Taking your
Boeing manual example, we have pages, TOC, list of figures, and the
like. If the thing is too large to be indexed as a whole (certainly in
your example), you can indeed fragment it into smaller pieces (say,
pages), and index them independently. You can pull up every page with
an unique query. That's your URI of the individual atomic
argument. Your page has some text, which will be full-text
indexed. You leave the up-to-date full text index in a specific
location, and notify the web crawler to pick up the index. No
robots.txt is necessary, because the spider doesn't have to index
anything (and pelt the poor web server with redundant queries), just
pick up the prefabricated index, guaranteed to be up to date (no
pointless polling whether document tree has changed somewhere),
resulting in much less load on the webserver (ideally, a single query
instead of millions) and on the network (these indexes are darn
compact).

You seem to imply there's some combinatorial explosion lurking
somewhere, but I fail to see where.

> About the best you can do is provide a link to your *own*
> fulltext search engine *or* dump the vocabulary list to the

You can do that, if you have a distributed search engine
standard. Btw, it might make sense to index 100-1000 or so sites in
the next-hop neighbourhood, since that way you don't have to amplify
your query to reach each individual node. That would seem rather
expensive.

> crawler, with a URL pointing to the root (or perhaps root
> with query applied).
 
A typical URI does include the database query to reach your atomic
document. I see no problems with associating a full text index of the
document (plus META keywords for pictures, and the like) with that URI.

> My guess is you were proposing the latter?

I'm not sure what you mean, I'm a bit confused.


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Mon Sep 18 2000 - 02:46:36 PDT