[FoRK] PersonalWeb patents "unique" hashes of content
Stephen D. Williams
sdw at lig.net
Wed Sep 19 22:38:44 PDT 2012
On 9/19/12 6:42 PM, J. Andrew Rogers wrote:
> On Sep 19, 2012, at 4:42 PM, Stephen D. Williams <sdw at lig.net> wrote:
>>> A content-addressable store uses some subset of the content to generate a key that can be used to address its storage location. The only reason hashing is ever done at all, particularly given the disadvantages, is to flatten key distributions.
>> If you use a subset of content to generate a key, it is a just a database. If you use all of the content to generate a highly unique key, it is a content-addressable storage database. This is part of the definition of "content-addressable storage".  Note that the semantics attached to that term are different from the term that would be the equivalent for RAM, where Content-addressible Memory  indicates a search-like "contains" relationship.
> I think you are conflating "definition" with "typical implementation".
> A 128-bit key has the same information with respect to the underlying content regardless of the algorithm used to generate it if it was constructed using at least 128-bits of content. The transform used to construct that 128-bit key is not material except to the extent the choice of transform impacts efficiency of the implementation. The only requirement of the transform is that it produces the same key every time it is given some content as input.
I don't see any definition or usage that is that relaxed. The defining difference is whether changing any bit of information in the
blob would change the key or not. If every bit counts, then it is a CAS, if not, it isn't, at least not in the original coinage or
common usage. Any key that is a subsample is not equivalent to having every bit affect the resulting key. The size of the key
isn't relevant, other than determining how likely collisions are. In both cases there is a database of some type that stores a blob
linked to a key. It's how the key is created that makes the difference. In most database cases, the key can be derived from the
data through some process, especially including 100% of relational databases.
From the aspect of relationship of key to stored data, there seem to be three cases:
File system, a loose type of database:
Key is not derived directly from data. Key is identity, but is not intrinsic. Not derivable at all.
Common concept of a database, esp. RDBMS but also most NoSQL, SPARQL, etc.:
Key is a subset of or derived directly from a subset of data. Key is only relative identity, repeatably derivable, but not globally
unique to data: i.e. similar but different data could generate the same key.
Content-addressable storage (CAS):
Key is derived from every bit/byte of the data stored in a way designed to make collisions rare and nearly impossible to
deliberately achieve. Key is globally and uniquely derivable from data. Any difference in data produces a different key, modulo
infinitesimal possibility of collision.
The test here are the kinds of applications that can be applied. The canonical CAS application could not work well on any other kind
of database method. The canonical application is storing any number of any kind of data exactly once for each unique blob, creating
a fixed-length short (hash length) key for each blob that is globally and repeatably unique. Document store, deduplicatating
block/backup store, etc. Store a bunch of files files independently in a distributed fashion so every instances uses the same key
for the same data allowing any kind of sharing, merging, or federation to work.
CDDB: That database method makes the assumption that no two CDs have the same number of tracks, in the same order, with the same
apparently random number of seconds of data. That is a very restricted case of the same general idea. I wouldn't call it a classic
CAS. It only seems like it because it doesn't seem likely that two CDs would have been published that have the same form but
different data. But that easily could be the case: Perhaps the first one was corrupted, or not quite the right amplitude, or some
noise was removed, or bad language was dubbed. All of those cases would fail to produce different keys for different data but the
> Strong hash functions are one type of transform capable of generating effective, collision-resistant keys from content with the required bits of information. They are not the only such transform.
>> The point of a Hilbert curve based index is to tend to put points that are close together in N-dimensional space, close together in 1 dimensional Hilbert space.
> Space-filling curves were largely never used for indexing for a reason. Hilbert-based indexes were among the best but still not great.
I understood that their most successful use was probabilistic clustering of record location for databases. Apparently someone
implemented a color space mapper in hardware using the Hilbert somehow.
> But since we are talking about content-addressable storage rather than indexing, Hilbert-like models are poor relative to some other curve-based schemes you could choose to use. The computation is only vaguely related to the kinds of computation you would use for an index; what makes Hilbert good for indexing has no bearing for content-addressing.
More information about the FoRK