[FoRK] An interesting offshoot from the iPad discussion
J. Andrew Rogers
andrew at ceruleansystems.com
Tue Feb 2 23:33:25 PST 2010
On Feb 2, 2010, at 10:59 PM, Lucas Gonze wrote:
> What kind of analytics? You mean business intelligence/ marketing dashboards?
Most non-trivial query workloads, the details do not matter much. Anything that uses transitive closures or data types that are not topologically degenerate deteriorate badly at terabyte scales. Graph and spatial analytics are well-known worst-cases. Most of the remainder become embarrassingly pathological by the time you reach a petabyte. Unfortunately, this describes a lot of high-value data sets. Once you get to the petabyte range, you are looking at the handful of things MapReduce can do, which isn't much for most practical purposes.
An important distinction is that even with the caveats specified above, we are talking about batch-mode, forensic analytics. If you want something that can scale to quasi-realtime query results then you will need to slash off a couple orders of magnitude. In some of the more pathological real-world cases that are not CPU bound per se, the data set limits are on the order of a gigabyte -- too slow in-memory on a typical processor even though computationally simple. Note that these are the limits of the algorithms used; they do not imply the limits of computer science.
> How come MapReduce is resistant to analytics?
MapReduce makes some *severely* restrictive assumptions about the data model; anything slightly complex quickly becomes intractable. It largely comes down to the consequences of requiring that data models, no matter how complex, be sharded. Most non-trivial analytics (graph, spatial, inductive, semantic, predictive, etc) violate that assumption. It worked great for mining simple text patterns, not so good for everything else.
More information about the FoRK