[FoRK] PersonalWeb patents "unique" hashes of content

J. Andrew Rogers andrew at jarbox.org
Thu Sep 20 10:42:15 PDT 2012

On Sep 20, 2012, at 12:52 AM, "Stephen D. Williams" <sdw at lig.net> wrote:
> Take N dimensions, normalize them to have roughly even membership density ranges, then Z-order curve interleave them to the desired key length, up to full values.  Since a 2D Z curve is equivalent to a quadtree, an ND is an Ntree.  If you index the keys in order, you can find all of the keys in the same subspace by observing a common prefix to the desired level.  Each "level" is the power of two resolution of each of the keys at their scaled values.  So a key range gives you a rough set, or you could create a set of ranges for a precise definition of subspace boundaries.

Yep, that is an old school simple case.

> It is a generalization of the quadtree idea mapped into a single dimension key space where nearby keys are in nearby spaces with only predictable spacial / bitwise relationships.  Nice when you always want to search on all of the dimensions in the key, and easily mappable to a relational database.  What happens when you want to wildcard one or more of the dimensions?  Seems like you would have to scan through all of the possible ranges for that dimension.

You seem to be about 15 years behind the current computer science for spatial structures. A lot has changed. :-)

For content-addressing schemes, you have three basic families: hash, point, interval. Hash is a subset of point, being a degenerate 1-dimensional case, and point is a subset of interval since points are just degenerate intervals. You lose no capability or performance characteristics in implementation going from left to right but you gain a lot of expressivity and implementation complexity. Hash addressing stubs out 95% of an interval addressing implementation and is also much easier for the average code monkey to wrap their mind around. Despite their complex structure, interval keys are not any more expensive to compute than a crypto hash key if you do it correctly. A good interval implementation is as insensitive to data skew and collisions as  a distributed hash table that uses a crypto hash.

An interval implementation can not only address content by an equality operator (i.e. hash addressing), it can also do things like address content based on arbitrary interval intersections, which is how sensible distributed polygon indexes work. Savvy implementations also have other less obvious but powerful addressing operators that can be used to parallelize algorithms that are not spatial in any conventional sense.

So back to the question of wildcard dimensions, it is indistinguishable from addressing any other hyper-rectangle intersection and imposes no unusual cost. A "wildcard" is equivalent to an interval that spans the universe in that dimension.

More information about the FoRK mailing list