Re: [Stupid Idea Series] WWW/2 Take 2

Kragen Sitaker (kragen@pobox.com)
Mon, 1 Mar 1999 09:27:42 -0500 (EST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Tim Byars: "Re: TBTF for 1999-03-01: Light of other days"
Previous message: Robert Harley: "Re: TBTF for 1999-03-01: Light of other days"
Maybe in reply to: Gregory Alan Bolcer: "[Stupid Idea Series] WWW/2 Take 2"
Next in thread: Robert S. Thau: "Re: [Stupid Idea Series] WWW/2 Take 2"
Next in thread: Tom Arkwright: "Re: [Stupid Idea Series] WWW/2 Take 2"

Re: automatically determining what pages are actually versions of each
other: I put together a little piece of software that heuristically
does exactly that, a couple of weeks ago. It seemed to do a reasonable
job under some limited circumstances. I haven't tested it on many data
sets, though, and it's currently O(N^2) in the number of documents.

Briefly, it runs through each document, hashing each length-n substring
of the document (with an incremental algorithm as used by Rabin-Karp
string-searching) and keeping a histogram of which hash values pop up
how often. (So your hash table just has ints in it; when you find a
length-n string with hash value x, you increment hashtable[x].)

You compare documents by comparing their histograms; I simply treated
the histograms as vectors with many dimensions and took their dot
product, then divided the dot product by the product of the vectors'
magnitudes. With different sizes of hash tables and different values
for 'n', I was able to get results on some test data sets that seemed
to mostly-correctly trace the genealogy of some source code,
distinguish different kinds of files (troff from executables from
ordinary text), and even distinguish different authors.

But I haven't really done thorough testing. So I might have just
gotten unusually positive results.

Anyway, I think it should be possible to come up with better ways of
finding close matches to a document than simply testing it against all
other documents.

(Anyone think I could publish a paper in a CS journal about this guy?
I'm not really familiar with the academic publishing world, and I don't
know what it takes to get a paper published, but this seems like it
ought to be interesting enough, if there's anything to it.)

The program code is theoretically available from
kragen-hacks-index@kragen.dnaco.net, but kragen.dnaco.net is quite ill
right now. I hope I can get it back up in a few days.

Re: mobility: I don't quite understand what you're proposing. You want
a fixed IP address for every device, regardless of location? If so,
why do you need dynamic DNS registration?

About dynamic DNS and updating your DNS record every time you go
online: I wrote a quick Perl hack that does exactly that at
<URL:http://pobox.com/~kragen/sw/dyndns.html>. I ran it and tested it
on my own machine at home, but I never used it in the larger world.
I'd be interested if anyone else did. ;)

In its current primitive form, it could probably handle a couple of
updates per second.

-- 
<kragen@pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
Techno addiction. More expensive than crack, keeps you up longer than
coke, makes you fatter than pot, but hey... it's legal. 
	-- Tim Byars <tbyars@earthlink.net>

Next message: Tim Byars: "Re: TBTF for 1999-03-01: Light of other days"
Previous message: Robert Harley: "Re: TBTF for 1999-03-01: Light of other days"
Maybe in reply to: Gregory Alan Bolcer: "[Stupid Idea Series] WWW/2 Take 2"
Next in thread: Robert S. Thau: "Re: [Stupid Idea Series] WWW/2 Take 2"
Next in thread: Tom Arkwright: "Re: [Stupid Idea Series] WWW/2 Take 2"