[FoRK] PersonalWeb patents "unique" hashes of content

Kent Spaulding kent at iotabits.com
Wed Sep 19 14:40:35 PDT 2012


As a counterpoint, cddb.  They did/do not hash the entire contents.

I think cddb also predates this patent?

On Sep 19, 2012, at 2:20 PM, Stephen D. Williams wrote:

> On 9/19/12 12:35 AM, J. Andrew Rogers wrote:
>> On Sep 18, 2012, at 11:16 PM, "Stephen D. Williams" <sdw at lig.net> wrote:
>>> I haven't seen anything referred to as "content addressable storage" that didn't mean "run some hash over the whole file and use that as the unique ID for that data".  You could add length of the file to increase resilience to collisions.  And there are sampling methods that should be good as likely, but not certain lightweight alternatives.  (Have to run an empirical on that soon.)  You could also run a filter or transformation to find equivalent files with different bytes, but that's just the same thing at a different granularity.
>>> 
>>> What are you thinking of as an alternative to whole-file hash?
>> 
>> The alternative is algorithmically pulling bits out of the data and using those as addressing keys. In principle it is completely reversible, basically just an optimized bit-ordering without any hashing. However, in most real implementations you only grab the subset of bits required at the moment and ignore the rest since they have no practical value. For a basic example, see simple Z-curve addressing in HPC or spatial algorithms. More sophisticated and adaptive variants exist.
> 
> If you just pull some of the information out as a key to the whole blob of data, then you could have two bobs of data that could be different, but have the same keys.  This is the basic database solution of key->data, with the possibility of matching keys for different data being one of the related problems.  A content addressable store uses, in effect, the whole data value as the key, modulo the very unlikely possibility of collision in the case of any kind of decimation of the data.  A perfect content addressable store uses the whole value to retrieve the record, but there's little point of that usually.  The "innovation" of a content addressable store is that a sufficiently good hash has very little statistical likelihood of collision so it is effectively equivalent to looking up a blob with the blob itself.  While you could also pull key information from the blob, the only globally unique key for that data is a strong ("crypto") hash of that data.
> 
> A "globally unique key" is something that is A) smaller than the original data, B) something that any number of distributed, uncoordinated programs can compute with no shared state, and C) very unlikely to exhibit collisions of the same key on different data. The only thing that meets those constraints in general are crypto hashes.  There's no reason that traditional key or clustering information couldn't also be extracted and prepended to the hash or used as a secondary key.
> 
> The Z-order curve [1] is a good way to linearize and cluster multi-variate keys.  I've investigated using Hilbert Curves [2][3] to do something similar.
> 
>> 
>> A reason you might want to do this instead of using a hash function is that it preserves relative value relationships beyond a pseudo-equality operator in the access method at no extra computational cost. The downside is that it requires more thoughtful design and the slick, general algorithms are not well documented. Hashing has the advantage that it is simple, easy to understand, and unencumbered but that is about it.
>> 
> 
> [1] http://en.wikipedia.org/wiki/Z-order_curve
> [2] http://en.wikipedia.org/wiki/Hilbert_curve
> [3] http://en.wikipedia.org/wiki/Hilbert_R-tree
> 
> sdw
> 
> _______________________________________________
> FoRK mailing list
> http://xent.com/mailman/listinfo/fork




More information about the FoRK mailing list