[FoRK] [info] (highscalability.com) The Secret to 10 Million Concurrent Connections -The Kernel is the Problem, Not the Solution

J. Andrew Rogers andrew at jarbox.org
Tue May 21 08:24:20 PDT 2013

On May 20, 2013, at 4:33 PM, Stephen Williams <sdw at lig.net> wrote:
> Sure, direct drive scatter / gather async request management has been around for Unix on SGI, Sun, etc. forever, used by Sybase & Oracle at least.  The SGI server that ran the database for Buddylist did that (1995), with the somewhat broken side effect on SGI Unix that it had to poll to do it so the CPU was always at 100% for the database threads.

I don't know the details of every UNIX implementation but my general impression was that these implementations are broken for the purposes of building exokernel-like applications. There are several "standard" ways to do async I/O in Linux but virtually all of them cannot be used for exokernel application construction. In 2013, there is a reasonable way to do most things required by an exokernel-style app on Linux.

It is not sufficient to have async requests for I/O; the methods must meet a number of other criteria. 

> Ideally, context switches are avoided and block traffic is exchanged between app threads and device drivers on dedicated threads with spinlock protected logical queues on heaps of blocks.

That implies a lot more performance-destroying multithreading behavior than you want in a real implementation. Ideally, threads almost never interact and when they do it is in a way that pipelines locality. So-called lock-free structures are a bottleneck and should be used sparingly.

> If you cared, and weren't using SSD, it wouldn't be that difficult to do the same for storage.  You'd just need a layout and logic that allows effective use of whatever block the drive decided to send you next.

I don't get why not using SSD is important; the code is about the same as with an HDD. Also, exokernels let applications take full control of the I/O scheduling (ignoring what the storage firmware/driver might do). Being able to impose a partial order on I/O operations that reflects a dynamic dependency graph is where a lot of the value is. Asynchronous does not imply unordered. 

> Networking is also theoretically not difficult; I just don't want to work on that kind of plumbing when I have much shinier and novel items on my queue.

Networking is the least important part; you can save it for last. The APIs in most operating systems are much better than for other forms of I/O in terms of letting you do something sensible. Also, for exokernel-like construction it often has the poorest support, so a double incentive to ignore it for as long as possible. 

Exokernel application skeletons are really neat pieces of elegant software with very dense functional complexity. Nothing quite like it in software. It is basically an OS kernel with none of the device driver stuff to worry about. (I've designed exokernel applications, we just didn't use the term "exokernel".)

More information about the FoRK mailing list