[FoRK] Science without explicit theory
J. Andrew Rogers
andrew at ceruleansystems.com
Sun Jul 6 18:27:07 PDT 2008
On Jul 6, 2008, at 10:37 AM, Jeffrey Winter wrote:
> While I can appreciate the notion that large data sets - when the
> right algorithms are applied - can offer interesting insights, really
> it's a matter of resolution. I suppose there is something akin to
> a theory lurking in the Bayesian network of relationships among
> the data points, but isn't the Bayesian analysis
> done up front - the meta-theory if you will - the more interesting
> aspect of this?
>
> Yes, there are immensely larger data sets to work with, but the
> insights gained from them are part of a continuum; nothing
> fundamentally
> different is happening at this magical petabyte level - or at least
> the article doesn't show that it is.
I don't care to go into this too much, but there is a practical magic
of sorts that occurs as the data sets we can compute over grows
exponentially, though what Google does is much too primitive to apply
here, scale notwithstanding, and it would probably look "emergent" to
your average computer scientist.
Having really, really large volumes of data allows you to recover
algorithm bits inductively that would be unrecoverable at smaller
volumes (underneath the information theoretic error floor), a bit like
cryptanalysis where they can start to recover a few bits here and
there. With sufficiently large datasets, even a boneheaded brute-
force search (like Google) will recover enough bits of a simple
pattern to become so obvious that anyone can find it given enough
compute, which would look like an emergent result if you were
previously working with a similar dataset a few orders of magnitude
smaller. But one does not have to be quite so crude even though no one
ever thinks about it in terms of mathematically recovering individual
bits of a larger algorithm.
No company anyone has heard of does this, but in theory we can
basically start to detect recoverable bits as soon as they arise above
the information theoretic floor, and once you collect enough you can
brute force the rest of the remaining bits for the algorithm that is
being recovered from the data. This point arises many orders of
magnitude before someone like a Google would trip across it by
accident. As the data sets get exponentially bigger, the complexity
and types of algorithms we can reverse engineer this way increases a
few bits at a time (which does not sound like much but it is a lot)
and are getting into the space that no human brain is going to be able
to accomplish the same feat with our lovely pattern-finding wetware,
and so interesting new things will be discovered. That aspect of it
makes it interesting with respect to science because computers can
develop models where the inductive process from which it was derived
exceeds human capability even if the result is simple and elegant.
The reason you will not see this coming out of Google any time soon is
that their system is essentially based on parallel grep over points in
1-space. I do not think too many people would argue that this is the
right tool for the job. Multi-petabyte datasets that do not fit into
that paradigm are ubiquitous, and exabyte scale datasets would exist
in large numbers if it was economical to manage and possible to
analyze meaningfully given that kind of technology.
Cheers,
J. Andrew Rogers
More information about the FoRK
mailing list