From: Kragen Sitaker (firstname.lastname@example.org)
Date: Fri Jun 30 2000 - 11:05:19 PDT
Sandor Spruit writes:
> What I was wondering while I read this: why don't we have a whole
> bunch of distributed search engines yet ? Isn't it *weird* that an
> activity that lends itself so well for both distributed and parallel
> processing is still almost exclusively handled by some handful of
> commercial companies ?
I think the answer is that doing it efficiently in a centralized
fashion is fairly easy, but doing it efficiently in a decentralized
fashion is really hard.
> Wouldn't searching be much easier if there would be, say, an Apache
> indexing module that could dynamically query other known servers in
> its environment ? Some search engines seem to work surprisingly well
> on huge numbers of pages. What if dozens of people would apply the
> same technology on their own servers and have them communicate ?
Well, there are basically two load issues: huge numbers of pages to
index, and huge numbers of queries. No single web server can handle
more than a tiny fraction of all the pages, and no single web server
can handle more than a tiny fraction of all the queries.
So if you direct an entire query to a single web server, it won't know
about more than a small fraction of the pages on the Web. So it will
have a small probability of giving you any particular page of the ones
> I know or at least suspect from what I've read that this is basically
> part of what Gnutella - or Freenet for that matter - is all about.
Yes. Gnutella sends a copy of your query to everyone within some
number of hops. This leads to some scaling problems: you can get all
the hits (at least within your search radius) but every server has to
handle a huge number of queries.
Freenet has a more scalable design, but it tends to suffer from not
being able to find things, or so I've heard.
Simple queries are fairly easy to handle; although I don't think anyone
has done this yet, I published the idea in 1997:
You split up the pages to be indexed among the participating nodes in
some fashion; if the participating nodes are web servers, then the
obvious choice is to have each web server index its own pages.
You split up the *terms* to be indexed among the participating nodes in
some fashion, too. A list of pages containing each particular term is
maintained on each of five or ten nodes. Assignment of terms to nodes
can be carried out, for example, by assigning serial numbers to nodes
when they join the network, and then using the first few bits or
characters of the search term to select a range of serial numbers.
Consistent hashing is another, possibly better, possibility.
When you index a page, you must contact every node responsible for a
term the page contains and ask that node to add that page to the list
of pages for that term.
You split up the searches among the participating nodes, too. This can
be done randomly, or by source IP address, or whatever.
When you search for a term, you just contact one or more of the nodes
responsible for the term and ask them for copies of their lists.
You could rank the pages in the lists by relevancy, using TF/IDF or
something similar, which would make this system usable for single-term
This system could be made fairly impervious to subversion by the
distributed participants; since nodes don't get to choose which terms
they index --- it can be chosen by consistent hashing of their IP
address, for instance --- it's likely that any particular term will be
owned by at least one node not interested in spamming that term. Thus,
attempts at spamming can be detected and ignored by the nodes doing the
It also scales fairly well: if the number of searches and the number of
pages remains proportional to the number of nodes, then the number of
searches each node must handle will remain constant; the number of
pages each node must index will remain constant; the only increasing
load is from maintaining lists of terms, due to the uneven distribution
of term occurrences, and I think that can be distributed out fairly
The big problem that made me not bother to implement it:
How do we do compound queries? As an example, the query `+criteria +autism
+eye-contact' on AltaVista finds ten thousand occurrences of ``eye
contact'', 31420 occurrences of ``autism'', and 602214 occurrences of
``criteria''. In this proposed distributed scheme, where the servers that
know where to find ``autism'' are not likely to be the same servers that
know where to find ``criteria'', how do we find the intersection of the
two? It is impractical to download 602,214 URLs -- even compressed -- and
I don't know how else to solve this, let alone rank them in order of
relevance. Yet, on AltaVista, the query only returns 32 documents, of
which the tenth was the one I wanted.
(Those numbers are from November 1997; the numbers now are 165 745, 406
444, 5 149 346, and 426, larger than the 1997 figures by factors of 17,
13, 9, and 13, respectively. Amusingly, bigdb.html itself is now hit
#2. Unfortunately, I can no longer find the information I wanted.)
-- <email@example.com> Kragen Sitaker <http://www.pobox.com/~kragen/> The Internet stock bubble didn't burst on 1999-11-08. Hurrah! <URL:http://www.pobox.com/~kragen/bubble.html> The power didn't go out on 2000-01-01 either. :)
This archive was generated by hypermail 2b29 : Fri Jun 30 2000 - 11:13:58 PDT