[FoRK] Big Data

J. Andrew Rogers andrew at jarbox.org
Sat Feb 4 10:51:19 PST 2012

On Feb 4, 2012, at 7:40 AM, Gregory Alan Bolcer wrote:
> For human data, there's only so many ways you can process it.  Most of the times more data only means more cost.  The dark matter is the space in between, aka the correlations.  

Most people are using the wrong tools and wrong data. Data analysis is being done on people centric data, like social graphs, because the applications are people centric. Using a social graph as the primary key of human behavior is defective because (1) there are widespread inconsistencies in the key set and (2) a vast number of entities influence human behavior that cannot be meaningfully represented in a human centric data model.

This argument has started to gain currency. So what should replace it? The primary key of reality, space and time. If you can track arbitrary entities and features in space-time, then you can infer most other relationships that we use in behavioral data models. It also provides the base model into which all data sources can be organized, you do not have the impedance mismatch you see between data models for, say, satellite imagery and social graphs. The beauty of this model is that it can be used to analyze the behavior of arbitrary systems, not just human behavior.

Once you have this type of analytical model, you need to be able to parallelize joins and transitive closures to make it useful.

And therein lies the problem. Most people doing "big data" are using very primitive distributed computing technologies like Hadoop, which does neither space-time data models nor graph analytics well.

> If anyone knows anything about highly correlated human data, it doesn't map well to divide and conquer approaches.  Techniques for mapping non-d-a-cq-bd are definitely ripe for some IP.

I think it would be more accurate to say that it does not map well to *naive* divide-and-conquer approaches. You won't get there using simple hash or range partitioning. There is already quite a bit of IP around more capable techniques but I have not seen any of it in open source or literature.

J. Andrew Rogers
Twitter: @jandrewrogers

More information about the FoRK mailing list