[FoRK] Science without explicit theory

J. Andrew Rogers andrew at ceruleansystems.com
Sun Jul 6 18:27:07 PDT 2008

On Jul 6, 2008, at 10:37 AM, Jeffrey Winter wrote:
> While I can appreciate the notion that large data sets - when the
> right algorithms are applied - can offer interesting insights, really
> it's a matter of resolution.  I suppose there is something akin to
> a theory lurking in the Bayesian network of relationships among
> the data points, but isn't the Bayesian analysis
> done up front - the meta-theory if you will - the more interesting
> aspect of this?
> Yes, there are immensely larger data sets to work with, but the
> insights gained from them are part of a continuum; nothing  
> fundamentally
> different is happening at this magical petabyte level - or at least
> the article doesn't show that it is.

I don't care to go into this too much, but there is a practical magic  
of sorts that occurs as the data sets we can compute over grows  
exponentially, though what Google does is much too primitive to apply  
here, scale notwithstanding, and it would probably look "emergent" to  
your average computer scientist.

Having really, really large volumes of data allows you to recover  
algorithm bits inductively that would be unrecoverable at smaller  
volumes (underneath the information theoretic error floor), a bit like  
cryptanalysis where they can start to recover a few bits here and  
there.  With sufficiently large datasets, even a boneheaded brute- 
force search (like Google) will recover enough bits of a simple  
pattern to become so obvious that anyone can find it given enough  
compute, which would look like an emergent result if you were  
previously working with a similar dataset a few orders of magnitude  
smaller. But one does not have to be quite so crude even though no one  
ever thinks about it in terms of mathematically recovering individual  
bits of a larger algorithm.

No company anyone has heard of does this, but in theory we can  
basically start to detect recoverable bits as soon as they arise above  
the information theoretic floor, and once you collect enough you can  
brute force the rest of the remaining bits for the algorithm that is  
being recovered from the data.  This point arises many orders of  
magnitude before someone like a Google would trip across it by  
accident. As the data sets get exponentially bigger, the complexity  
and types of algorithms we can reverse engineer this way increases a  
few bits at a time (which does not sound like much but it is a lot)  
and are getting into the space that no human brain is going to be able  
to accomplish the same feat with our lovely pattern-finding wetware,  
and so interesting new things will be discovered.  That aspect of it  
makes it interesting with respect to science because computers can  
develop models where the inductive process from which it was derived  
exceeds human capability even if the result is simple and elegant.

The reason you will not see this coming out of Google any time soon is  
that their system is essentially based on parallel grep over points in  
1-space.  I do not think too many people would argue that this is the  
right tool for the job.  Multi-petabyte datasets that do not fit into  
that paradigm are ubiquitous, and exabyte scale datasets would exist  
in large numbers if it was economical to manage and possible to  
analyze meaningfully given that kind of technology.


J. Andrew Rogers

More information about the FoRK mailing list