[FoRK] Multicore, async segmented sequential models, Re: outsourcing back end

Stephen D. Williams sdw at lig.net
Wed May 8 02:15:53 PDT 2013

On 5/7/13 5:23 PM, J. Andrew Rogers wrote:
> On May 7, 2013, at 4:43 PM, "Joseph S. Barrera III" <joe at barrera.org> wrote:
>> On 5/7/2013 4:35 PM, Owen Byrne wrote:
>>> The actual architecture (single threaded with asynchronous callbacks)
>>> makes it quite a bit easier to wrap your head around than some
>>> other multi-processing architectures like say, Java multithreading.
>> But isn't single threading kind of out of step with modern architectural trends, with pretty much every processor being multicore, especially those on servers?

At both NCR (mid-1980's) and AOL (1995-1998), the programming models were event loops processing messages streamed through TCP/IP.  
This provided all kinds of performance advantages including fewer than one IO or network packet per typical transaction, very 
efficient processing because there was far fewer than one system call per transaction, low memory usage (the single Buddylist server 
was an HPux server with 2GB of RAM and a 1GB per-process limit), and clean programming model.  All services at AOL were implemented 
via home-grown lightweight async message passing and routing, with each process having a message router and memory queuing, retry 
logic, and a TCL-based shell-like command server for control, status, debugging, and configuration.

I was puzzled for many years about why the rest of the world was so ignorant of this model.  I believe it was well-known in some 
parts of the mainframe world, probably the guts of CICS or similar, but I've never seen anything published.  I tried to educate 
people on projects and publicly in various forums about the advantages of this, but since most programmers have a herd mentality and 
all APIs, examples, magazines, books, etc. were talking about thread-per-connection, synchronous RPC, etc., it was just a lost 
cause.  We needed a lot more of the world to feel the pain, realize the problem, and be receptive to better models.  Basically, it 
has taken 20+ years for general software development to mature.  You could blame this on Microsoft et al if you like.

Node and Nginx get it right in a lot of ways.  Ideally, the kernel would package multiple requests and IO events on multiple 
connections into a message stream that could be processed in far fewer IOs, but otherwise, the model is not bad.  (AOL essentially 
had communications concentrators at the edges that turned all communication into packets in the message flow.  Packets were 
transported as streams of bytes on TCP/IP connections.  So if messages were 60 or 200 bytes (easily possible given the app styles), 
many could fit in a single 30K TCP/IP buffer read.)

As this points out, you can get multiple processes to handle connections on a single listener in Linux:
> In |node-v0.1.98|, the Yahoo!-contributed file descriptor passing and re-use patches landed in core and allowed the emerging set 
> of HTTP frameworks such as Connect <http://github.com/senchalabs/connect> and multi-node <http://github.com/kriszyp/multi-node> to 
> serve HTTP requests on multiple cores simultaneously with no change in application code or configuration.
> Briefly, the approach used by these frameworks is to create a bound and listening in a single process (say, bound to port 80). 
> However, rather than accepting connections using this socket, it is passed off to some number of child processes using 
> |net.Stream.write()| (under the covers this uses |sendmsg(2)| and FDs are delivered using |recvmsg(2)|). Each of these processes 
> in turn inserts the received file descriptor into its event loop and accepts incoming connections as they become available. The OS 
> kernel itself is responsible for load balancing connections across processes.
> It's important to note that each this is effectively an L4 load balancer with no affinity; each request by any given client may be 
> served by any of the workers. Any application state that needs to be available to a request cannot simply be kept in-process in a 
> single NodeJS instance.

That's cool, but causes a problem with maintaining state.  A good stateless API style would work perfectly.  Or the initial landing 
point which forwards the browser to a worker URL which either doesn't split the connect handling or uses the proxy method:

> *Using NodeJS to route requests*
> In some cases it may be impossible or undesirable to use either of the above two facilities. For example, one's application may 
> require affinity that cannot be configured using a load balancer (e.g., policy decisions based on complex application logic or the 
> SE Linux security context of the incoming connection). In such cases, one can accept a connection in a single process and 
> interrogate it before before handing it off to the correct process for handling.
> The following example requires |node-v0.1.100| or later and node-webworker <http://github.com/pgriess/node-webworker>, a NodeJS 
> implementation of the emerging HTML5 Web Workers <http://www.whatwg.org/specs/web-workers/current-work/> standard for parallel 
> execution of JavaScript. You can install |node-webworker| using npm <http://npmjs.org/> by executing |npm install webworker at stable|.
> While an in-depth explanation of Web Workers is beyond the scope of this article, for our purposes one can think of a worker as an 
> independent execution context (such as a process) that can pass messages back and forth with the JavaScript environment that 
> spawned it. The |node-webworker| implementation supports sending around file descriptors using this message passing mechamism.

Process or thread per connection was always wrong for something meant to be lightweight or scalable.  Good for SSH, bad for HTTP, 
SMTP, DNS, etc.

Callback or other types of async programming are what I used to call pseudo-threading.  Nearly all GUI paradigms use exactly this model: The GUI has only a single thread.  The application can put work in background threads, temporary, via workers, or long-lived, but the GUI-related results must be queued to happen on the UI thread.  This used to be fairly painful, but anonymous inner classes in Java (close enough to closures) always provided a nice way out.  In Android, with AsyncThread or similar, it is possible to write very concise code in essentially a sequential manner that executes in async fashion.  Lambdas in C++11 and blocks in Objective-C/C++ provide this same power.  Javascript, with first class function objects, easily provides this although I haven't seen examples of easy arbitrary nesting yet.  All GUI systems have at least one message / object queue.  Qt has many, with the ability to put non-GUI instances on their own threads yet allow all coding to be essentially sequential, or what you might call segmented sequential.  Qt now supports lambdas, so that should clean code up tremendously.

Some examples of usage and nuances of UI vs. background on Android are in the following link.

> Yes and no. Classic multi-threading architecture (locks, thread pools, shared resources, etc) is definitely on the way out. Node.js is not a good example of this but it captures a bit of the flavor.
> Modern multicore processors are what is driving software *away* from threading. CPUs have become so efficient at retiring instructions that anything which injects modest latency has a huge cost in terms of throughput.

The multicore concurrency control primitives are valid and useful. Such as those abstracted well enough and Java and now nicely 
portable at the C++ level with C++11 <thread>, <mutex>, <condition_variable>, <future>, etc. http://en.cppreference.com/w/cpp

However, the model of using these primitives has often been wrong-headed, clearly misunderstanding various basic performance issues 
at the logical, computer architectural, statistical, and scalable levels.  I've seen a number of products, papers, and other 
attempts at turning sequential algorithms into algorithms that try to use all cores for a single loop.  Most threading systems 
create and destroy thousands of threads, constantly churning memory and spending multiples of system calls and other latency.  
Outside of certain specialized applications, such as full-mesh physics sims with no time travel, it is usually the case that each 
core should do independent work with as little synchronization and data sharing as possible.  This essentially means sequential 
coding, or sequential coding inside reasonably large blocks of code.  If all of your logic falls below the "reasonably large" 
threshold, then you shouldn't be dealing with threading in your code, especially if you are in an easy async next-step queuing system.

> Some of the biggest culprits for avoidable latency: context-switching, cache coherency, and non-local memory references. The easiest way to design these out of your code is to have a single thread/process per core and to schedule all operations in userspace using coroutines, callbacks, etc. Any interactions between threads is used sparingly and based on message passing. There may be a shared address space but very little actual sharing going on.

Right.  Message passing should be done with atomic, non-blocking, spinlock or similar statistically lock-free methods.  Qt has this 
built in for key objects; C++11 makes this easy to add anywhere. Obviously, you want to optimize the number of threads.  Sometimes 
this is one per core, sometimes 2 per core, depending on what it takes to get full utilization.

> In practice, the performance benefit to doing things this way compared to the classical multithreaded paradigm is integer factor. Another benefit is that most code can be written in a simple, single-threaded style since there are no locking considerations but you also have to design your own schedulers. Cooperative multitasking is back!

GUI systems, and async segmented sequential systems like Node and nginx, handle the scheduling for you for the most part.

I've recently designed and mostly built a general purpose but specially designed multi-core and multi-core-type (CPU cores, GPU 
cores, DSP cores, HW blocks) thread pool, buffer pool, job queuing, parameter management, preallocation w/ heavy reuse, C++ 
scheduler architecture.  It solves a lot of difficult problems elegantly, including making processing step development very easy, 
while having practically no overhead.  The next job can be queued or selected with normally-zero-time locking, allowing each thread 
worker or worker proxy to run unimpeded until their queue type runs out.

GPU cores use these types of architectures internally.  The trick is to integrate your overall program into a clean architecture.  
Intel and others have partial solutions, but they are clumsy or incomplete in various ways.  Nvidia has a new system for managing 
this on Android between CPU and GPU tasks.  Microsoft AMP is a little interesting.  OpenCL, which can target CPUs, GPUs, and perhaps 
other types of processors, needs something like this, but doesn't seem to have it yet: You have to choose to target either the CPU 
or GPU.

Halide is a very interesting attempt to combine scheduling, easy loop / tiling tuning, code generation, etc. to solve this problem 
for image processing.  Very cool, currently restricted in what you can build with it, and the implementation, which has mostly 
reasonable choices, is still a bit of a research hack.

> Andrew


More information about the FoRK mailing list