[FoRK] PersonalWeb patents "unique" hashes of content
Stephen D. Williams
sdw at lig.net
Wed Sep 19 16:42:49 PDT 2012
On 9/19/12 4:13 PM, J. Andrew Rogers wrote:
> On Sep 19, 2012, at 2:20 PM, "Stephen D. Williams" <sdw at lig.net> wrote:
>> On 9/19/12 12:35 AM, J. Andrew Rogers wrote:
>> This is the basic database solution of key->data, with the possibility of matching keys for different data being one of the related problems. A content addressable store uses, in effect, the whole data value as the key, modulo the very unlikely possibility of collision in the case of any kind of decimation of the data.
> A content-addressable store uses some subset of the content to generate a key that can be used to address its storage location. The only reason hashing is ever done at all, particularly given the disadvantages, is to flatten key distributions.
If you use a subset of content to generate a key, it is a just a database. If you use all of the content to generate a highly
unique key, it is a content-addressable storage database. This is part of the definition of "content-addressable storage". 
Note that the semantics attached to that term are different from the term that would be the equivalent for RAM, where
Content-addressible Memory  indicates a search-like "contains" relationship.
> Skew is the problem of key distribution not matching storage distribution. It does not matter what these distributions look like as long as they match up. Using a hash function to flatten data across a simple array of storage buckets is just the simple/lazy way to achieve that but far from the only way.
That is a different, additional reason to use a hash.
>> A "globally unique key" is something that is A) smaller than the original data, B) something that any number of distributed, uncoordinated programs can compute with no shared state, and C) very unlikely to exhibit collisions of the same key on different data. The only thing that meets those constraints in general are crypto hashes.
> This is an assertion too far. Plenty of very large-scale content-addressable systems do not require or use cryptographic hashes for content-addressability. Even in cases where some or all of the content is hashed to improve value distribution, the content-addressing still is not hash-based. The last time I worked on a parallel or distributed system that was hash-based was probably circa 2005.
Those are just databases, not "content-addressable storage" databases as commonly understood. 
>> The Z-order curve  is a good way to linearize and cluster multi-variate keys. I've investigated using Hilbert Curves  to do something similar.
> Z-curve, as used here, has nothing to do with linearizing keys. It would not make much sense since we are talking about content-addressability.
> I suppose it might be possible to build a Hilbert structure for the same purpose but I expect it would be much slower.
The point of a Hilbert curve based index is to tend to put points that are close together in N-dimensional space, close together in
1 dimensional Hilbert space.
> Content Addressable Storage (CAS) and Fixed Content Storage (FCS) are different acronyms for the same type of technology. The CAS
> / FCS technology is intended to store data that does not change (fixed) in time. The difference is that typically CAS exposes a
> digest generated by acryptographic hash function <http://en.wikipedia.org/wiki/Cryptographic_hash_function>(such asMD5
> <http://en.wikipedia.org/wiki/MD5>orSHA-1 <http://en.wikipedia.org/wiki/SHA-1>) from the document it refers to. If the hash
> function is weak, this method could be subject to collisions in an adversarial environment (different documents returning the same
> When a new data element, or*blob*(Binary large object <http://en.wikipedia.org/wiki/Binary_large_object>), is added, the device
> calculates ahash <http://en.wikipedia.org/wiki/Hash_value>of the content and returns this hash as the blob's content address.^
> <http://en.wikipedia.org/wiki/Content-addressable_storage#cite_note-techworld.com-3> As mentioned above, the hash is searched for
> to verify that identical content is not already present. If the content already exists, the device does not need to perform any
> additional steps; the content address already points to the proper content. Otherwise, the data is passed off to a storage node
> and written to the physical media.
> When a content address is provided to the device, it first queries the directory for the physical location of the specified
> content address. The information is then retrieved from a storage node, and the actual hash of the data recomputed and verified.
> Once this is complete, the device can supply the requested data to the client. Within the Centera system, each content address
> actually represents a number of distinct data blobs, as well as optionalmetadata <http://en.wikipedia.org/wiki/Metadata>. Whenever
> a client adds an additional blob to an existing content block, the system recomputes the content address.
> Another typical implementation is from iTernity. The concept of iTernity bases of container, each container is addressed by its
> hash value. A container is a multiple number of fixed content documents, so one container is not changeable and the hash value is
> fixed after the write process.
> Unlike standard computer memory (random access memory <http://en.wikipedia.org/wiki/Random_access_memory>or RAM) in which the user
> supplies a memory address and the RAM returns the data word stored at that address, a CAM is designed such that the user supplies
> a data word and the CAM searches its entire memory to see if that data word is stored anywhere in it. If the data word is found,
> the CAM returns a list of one or more storage addresses where the word was found (and in some architectures, it also returns
> thedata word <http://en.wikipedia.org/wiki/Data_word>, or other associated pieces of data).
More information about the FoRK