[FoRK] PersonalWeb patents "unique" hashes of content
J. Andrew Rogers
andrew at jarbox.org
Wed Sep 19 00:35:08 PDT 2012
On Sep 18, 2012, at 11:16 PM, "Stephen D. Williams" <sdw at lig.net> wrote:
> I haven't seen anything referred to as "content addressable storage" that didn't mean "run some hash over the whole file and use that as the unique ID for that data". You could add length of the file to increase resilience to collisions. And there are sampling methods that should be good as likely, but not certain lightweight alternatives. (Have to run an empirical on that soon.) You could also run a filter or transformation to find equivalent files with different bytes, but that's just the same thing at a different granularity.
> What are you thinking of as an alternative to whole-file hash?
The alternative is algorithmically pulling bits out of the data and using those as addressing keys. In principle it is completely reversible, basically just an optimized bit-ordering without any hashing. However, in most real implementations you only grab the subset of bits required at the moment and ignore the rest since they have no practical value. For a basic example, see simple Z-curve addressing in HPC or spatial algorithms. More sophisticated and adaptive variants exist.
A reason you might want to do this instead of using a hash function is that it preserves relative value relationships beyond a pseudo-equality operator in the access method at no extra computational cost. The downside is that it requires more thoughtful design and the slick, general algorithms are not well documented. Hashing has the advantage that it is simple, easy to understand, and unencumbered but that is about it.
More information about the FoRK