[FoRK] PersonalWeb patents "unique" hashes of content
J. Andrew Rogers
andrew at jarbox.org
Wed Sep 19 16:13:25 PDT 2012
On Sep 19, 2012, at 2:20 PM, "Stephen D. Williams" <sdw at lig.net> wrote:
> On 9/19/12 12:35 AM, J. Andrew Rogers wrote:
>> The alternative is algorithmically pulling bits out of the data and using those as addressing keys. In principle it is completely reversible, basically just an optimized bit-ordering without any hashing. However, in most real implementations you only grab the subset of bits required at the moment and ignore the rest since they have no practical value. For a basic example, see simple Z-curve addressing in HPC or spatial algorithms. More sophisticated and adaptive variants exist.
> If you just pull some of the information out as a key to the whole blob of data, then you could have two bobs of data that could be different, but have the same keys.
I would expect collisions if you only use a subset of bits. The underlying algorithms can usually be used as a perfect hash function, though it would be pointless. Collisions do not have the negative connotation here that they do for hash functions.
> This is the basic database solution of key->data, with the possibility of matching keys for different data being one of the related problems. A content addressable store uses, in effect, the whole data value as the key, modulo the very unlikely possibility of collision in the case of any kind of decimation of the data.
A content-addressable store uses some subset of the content to generate a key that can be used to address its storage location. The only reason hashing is ever done at all, particularly given the disadvantages, is to flatten key distributions.
Skew is the problem of key distribution not matching storage distribution. It does not matter what these distributions look like as long as they match up. Using a hash function to flatten data across a simple array of storage buckets is just the simple/lazy way to achieve that but far from the only way.
> A "globally unique key" is something that is A) smaller than the original data, B) something that any number of distributed, uncoordinated programs can compute with no shared state, and C) very unlikely to exhibit collisions of the same key on different data. The only thing that meets those constraints in general are crypto hashes.
This is an assertion too far. Plenty of very large-scale content-addressable systems do not require or use cryptographic hashes for content-addressability. Even in cases where some or all of the content is hashed to improve value distribution, the content-addressing still is not hash-based. The last time I worked on a parallel or distributed system that was hash-based was probably circa 2005.
> The Z-order curve  is a good way to linearize and cluster multi-variate keys. I've investigated using Hilbert Curves  to do something similar.
Z-curve, as used here, has nothing to do with linearizing keys. It would not make much sense since we are talking about content-addressability.
I suppose it might be possible to build a Hilbert structure for the same purpose but I expect it would be much slower.
More information about the FoRK