[FoRK] Quarters needed for Apple to put Microsoft out of our misery?

J. Andrew Rogers andrew at jarbox.org
Wed Feb 18 12:25:18 PST 2015


(reordered slightly)


> This seems like the hard way to do things. Why model everything when you
> can model only the thing you need to know about? It's not like modeling is
> a solved problem. You can't push a button and out pops a model.


In the physical world, environmental influences on the behavior of systems are powerful and complex. The factors that can materially influence the behavior of a system are myriad, diverse, and must be accounted for in any model that claims robust predictive properties. Sometimes, adequately modeling the thing you need to know about requires mind-boggling amounts of ancillary data.

While it would make for simpler models, there are no closed systems in the real world, and for many interesting dynamics they cannot even be usefully approximated as closed systems. The "closed system” assumption is why predictive models around everything from climate change to economics fail regularly.

If you want to tackle the hard problems, there is only one model that matters. A lot of things we think we know aren’t so when you actually measure them.


> It's a physical reconstruction?
> What's in it? Temperature, humidity, light, density?
> What's not in it? Money? Identity?


Any data that can be spatiotemporally registered (i.e. virtually everything, even packets) can be placed into the model, but in practice there are some basic data sources that are widely useful (e.g. weather, some Internet data) and then a bunch of specialty or proprietary sources that are specific to the application (e.g. remote sensing data for environmental monitoring or a company’s sales data). Sensor data in particular is popular because it traditionally has had very limited analyzability and there is a lot of it.

At this scale, everything looks like a fluid dynamics model; the individual particles are not that important. Most kinds of entities are statistically uninteresting at the individual level when modeling complex behaviors and dynamics at these scales. Which is helpful because this kind of thing is extremely bandwidth-bound. PII never enters from external sources in these applications for many reasons, including regulatory and contractual, but also because it is almost never helpful. Even when it is about aggregate behavior of people, it is not about a person.

Data quality at this volume and velocity cannot be ensured by inspection and curation so you have to do statistical entity and source blending in addition to the data reduction just to remove the errors and noise regardless of the type of data. As a consequence, a data point in a nicely polished model often can’t be easily mapped back to a specific entity or measurement.


> What is a use case?


Three major use cases, all more boring than most people imagine:

1) Maximizing operational efficiency. There is a vast gap between how organizations think their world works and how it actually works but they’ve never been able to measure it with sufficient fidelity or drive operational decisions from those measurements. Everything from logistics optimization to real-time location analytics to smarter cities to customer satisfaction. 

2) Ground truth and risk detection. There are many aspects of the world that are rarely measured or looked at because it is expensive, inconvenient, or impractical for people to do so. We know that there are fast-moving dynamics in nature that are never captured and which are often detected only after they create detectable anomalies in the human economy. Not only can sensor blending give a real-time operating picture of infrastructure and the broader planet, but increasingly it will be possible to task robotic sensors to investigate more thoroughly when interesting or unusual patterns are detected. Any industry closely tied to natural resources (e.g. agriculture) or heavy infrastructure cares about this a lot.

3) Managing sensor data e.g. the whole Internet of Things craze. Billions of dollars invested before anyone realized that Hadoop, Spark, MemSQL, etc are worthless for dealing with sensor data. This kind of platform is a natural fit because as a database it naturally organizes sensor data in the same way the physical world is organized. Some even have a SQL interface.


> Is there are than one of it? Who is building It or Them?


A bunch of small-ish private ones currently, scattered across industrial sectors but the economics are rapidly pushing convergence toward a large, shared base model. Much sooner than later, it will be a pure shared cloud and geographically federated at that. It is the only practical way forward.

Building these things required solving a pretty tall and diverse stack of previously unsolved computer science problems. Since I developed and designed most of the original technology (before the turn of the decade!), I am the de facto expert on the technical implementation of these types of systems. 
 

> That this could exist stretches credibility. A single integrated top-down
> model of everything everywhere is implausible.


It is still pretty nascent and far from the ultimate potential but people who have actually seen these things running usually use terms like “jaw-dropping” to describe their first contact. (AFAIK, and inconveniently, no one has come up with a name for this category of software system. It was not already in the tech lexicon.) 

You do not need “everything everywhere” to radically reshape an industry. There is a lot of low-hanging fruit; the capability has not existed until recently. The blind man that suddenly gains some vision isn’t going to care too much if it is 20/20.

The technical limitation is bandwidth, which has a pretty high ceiling. It is a pretty trivial thing, and not particularly expensive, to drive a petabyte per day through disk storage in a single data center rack — that is more data than Facebook collects globally. Machine generated data sources are much bigger than Facebook but not unmanageably so. 



> All that said, we are already living in a dystopia wrought by the
> intelligence community, so I can imagine those folks doing what you're
> describing.


Not really, stuff like this is always out in the private sector first. The intelligence community, while occasionally showing some agility, is still a giant political bureaucracy. Uses cases would be different obviously.

Basically, it is bringing analytics applied to the virtual world into the physical world, with analogous implications. 






More information about the FoRK mailing list