[FoRK] Joyent, cloud service evolution (J. Andrew Rogers)
k.holtman at chello.nl
Wed Jul 13 16:12:34 PDT 2016
On Tue, 21 Jun 2016, Marty Halvorson wrote:
> After reading all this, I wonder what you'll think of how the CERN LHC
> collects data?
> Each experiment generates 10's of trillions of bit's per second. A reduction
> of up to 2 orders of magnitude is done in hardware, the result is sent via
> very high speed pipes to a huge number of PC's where a further reduction of 1
> or 2 magnitudes is done. Finally, the data is stored for research purposes
> usually done on collections of supercomputers.
I usually just lurk on this list, but I worked on LHC data processing back
in the late 1990s/early 2000s, so I'll try to relate CERN LHC data
handling to the rest of this thread.
When the LHC is running, the combined signal pick-up elements around the
particle collission region in each LHC detector generate data in the order
of 10s of TBytes/s. This data stream is filtered in real time: only
interesting data, for carefully worked out definitions of interesting, is
selected for storage. Each detector has dedicated hardware built for the
first data filtering stages, then a real time compute farm takes over to
filter further, down to a storage rate of about 1 Gbyte/s. So overall
this has a lot of similarities to the more extreme forms of IoT real-time
massive-scale sensor data stream processing.
Data analysis after storage consumes the bulk of the compute capacity used
by the LHC experiments, but it is 'embarrasingly parallel', it is a
textbook example of a type of a workload mix that is easy to run at scale
on COTS farms. Supercomputers (massive machines that have unusual
hardware to support parallelism) are not required here. The COTS farms
that are used do need to have quite some internal I/O bandwidth between
bulk storage and compute cores though.
When described in modern big data terms, the compute workload of data
analysis on stored LHC data is dominated by trial data cleaning runs done
on real and simulated data, trial runs needed to iteratively develop the
data cleaning algorithms. Every large LHC experiment has hundreds of
people working continuously on what is basically data cleaning algorithm
development. Work on these algorithms for the LHC started in the early
1990s, at the same time when design work on the LHC detectors started.
Some recent numbers about the LHC are here:
In the last few years I have been working on big data and IoT. When I read
about the computing techniques people are discussing in this context, I am
often struck with an LHC deja-vu. Of course all the old techniques that
we determined we needed have newly minted names, or are referred to
indirectly by naming open source projects.
In my view (and this view has definately been colored by personal
experience), if you cut away the hype and look at the real promise of
economic impact, then IoT and big data are just the latest tools in the
process improvement toolbox. Here I count 'consumer behavior
modification' also as a form of process improvement, a form that is
unfortunately too often being played as a long-term negative-sum game.
The biggest hurdle to process improvement has always been figuring out how
to overcome human and orginational inertia to change. The term IoT is
useful in change marketing, so a lot of process improvement innovation
that wil happen in the next 10 years will be called IoT. A lot of this
innovation will not depend at all on having any new breakthoughs in data
processing speed or scalability. The state of current data processing
open source and best practices make for an interesting discussion topic on
this list, however.
More information about the FoRK