[FoRK] Joyent, cloud service evolution (J. Andrew Rogers)

Koen Holtman k.holtman at chello.nl
Wed Jul 13 16:12:34 PDT 2016



On Tue, 21 Jun 2016, Marty Halvorson wrote:

> After reading all this, I wonder what you'll think of how the CERN LHC 
> collects data?
>
> Each experiment generates 10's of trillions of bit's per second.  A reduction 
> of up to 2 orders of magnitude is done in hardware, the result is sent via 
> very high speed pipes to a huge number of PC's where a further reduction of 1 
> or 2 magnitudes is done.  Finally, the data is stored for research purposes 
> usually done on collections of supercomputers.

I usually just lurk on this list, but I worked on LHC data processing back 
in the late 1990s/early 2000s, so I'll try to relate CERN LHC data 
handling to the rest of this thread.

When the LHC is running, the combined signal pick-up elements around the 
particle collission region in each LHC detector generate data in the order 
of 10s of TBytes/s.  This data stream is filtered in real time: only 
interesting data, for carefully worked out definitions of interesting, is 
selected for storage.  Each detector has dedicated hardware built for the 
first data filtering stages, then a real time compute farm takes over to 
filter further, down to a storage rate of about 1 Gbyte/s.  So overall 
this has a lot of similarities to the more extreme forms of IoT real-time 
massive-scale sensor data stream processing.

Data analysis after storage consumes the bulk of the compute capacity used 
by the LHC experiments, but it is 'embarrasingly parallel', it is a 
textbook example of a type of a workload mix that is easy to run at scale 
on COTS farms.  Supercomputers (massive machines that have unusual 
hardware to support parallelism) are not required here. The COTS farms 
that are used do need to have quite some internal I/O bandwidth between 
bulk storage and compute cores though.

When described in modern big data terms, the compute workload of data 
analysis on stored LHC data is dominated by trial data cleaning runs done 
on real and simulated data, trial runs needed to iteratively develop the 
data cleaning algorithms.  Every large LHC experiment has hundreds of 
people working continuously on what is basically data cleaning algorithm 
development.  Work on these algorithms for the LHC started in the early 
1990s, at the same time when design work on the LHC detectors started.

Some recent numbers about the LHC are here:

https://home.cern/about/computing/processing-what-record

In the last few years I have been working on big data and IoT. When I read 
about the computing techniques people are discussing in this context, I am 
often struck with an LHC deja-vu.  Of course all the old techniques that 
we determined we needed have newly minted names, or are referred to 
indirectly by naming open source projects.

In my view (and this view has definately been colored by personal 
experience), if you cut away the hype and look at the real promise of 
economic impact, then IoT and big data are just the latest tools in the 
process improvement toolbox.  Here I count 'consumer behavior 
modification' also as a form of process improvement, a form that is 
unfortunately too often being played as a long-term negative-sum game.

The biggest hurdle to process improvement has always been figuring out how 
to overcome human and orginational inertia to change.  The term IoT is 
useful in change marketing, so a lot of process improvement innovation 
that wil happen in the next 10 years will be called IoT.  A lot of this 
innovation will not depend at all on having any new breakthoughs in data 
processing speed or scalability.  The state of current data processing 
open source and best practices make for an interesting discussion topic on 
this list, however.

Koen.


More information about the FoRK mailing list