Google censorship (period)

Kragen Sitaker kragen@pobox.com
Thu, 4 Apr 2002 05:19:19 -0500 (EST)


Eugene Leitl writes on FoRK:
> On Sun, 24 Mar 2002, Kragen Sitaker wrote:
> > Eugene Leitl writes:
> > > The only way to make a search engine uncensorable is to make it P2P...
> > > Just migrate htdig or similiar into Apache, and put the full text indices
> > > in a standard location. Plus put a spider to sweep local network
> > > neighbourhood and to pick up these. Sounds like a person-afternoon for a
> > > professional, probably several weeks for more pedestrian types.
> > This design doesn't work.  To search for any unusual term, you must
> > send a search to every local network neighborhood on the planet; and
> > to search for any common term, you must collate results from every
> > local network neighborhood on the planet and rank them somehow.
> 
> My machines run 24 h/day, hard drive space is cheap, and will soon move
> into TByte range/disk. Digital currency based infrastructure would allow
> me to put that resource to good use by sweeping a neighbourhood of
> 10^3..10^4 boxes without having to suffer from freeloaders.

Do you mean because you would pay those 10^3..10^4 boxes for their
localindex.bz2 files?

> While local network neighbourhood allows me to max out the pipe, I
> can (and should) make a fraction of it long-range (since spidering
> is merely making a query for localindex.bz2, and grabbing it if it's
> there it's cheap). At indexing scale of 10^3 and query fanout factor
> of 3 you are querying the index of a billion nodes

How do you get query fanout?  Are you suggesting that my
localindex.bz2 should contain the localindex.bz2 of everybody I've
pulled from?

Local network neighborhood doesn't give you exponential fanout, since
the folks in your neighborhood are mostly in each other's
neighborhoods.

> -- and a very up-to-date index, even without pickup notification,
> which is cheap (each node knows when it content has changed
> significantly).

Dave Winer's XML-RPC push-based search engine interface doesn't seem
to have caught on in the last three years, unfortunately.  Maybe this
would help.

> The query packet is tiny (seem to ask for UDP), and hits will come
> from many nodes randomly distributed from across the world. I
> haven't thought about ranking yet, but I'm sure there's a
> way. Ranking has to evaluate the global pool, so you would have to
> identify the documents with high hyperlink rate, which are prone to
> show up even in a small subset of the global index. Looks like the
> nodes need to talk to each other a bit, to compute a global ranking
> index (which doesn't have to be exhaustive to be useful, and the
> least significant hits will be less often looked at anyway, so high
> error rate there doesn't hurt too badly).

Ranking is especially important for unpopular search topics.

> > See http://pobox.com/~kragen/bigdb.html for more thoughts on this
> > issue, which is something I've been concerned about and trying to find
> > solutions to since before Google started operation.
> 
> I've been thinking about global P2P since 1996, or so.

Sorry --- I wasn't trying to get into a dick size war.  No doubt there
are people on FoRK who have been thinking about it since 1980.

-- 
<kragen@pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
Death seems to both sober and brighten people.  --- Marissa Mika