[FoRK] [info] (highscalability.com) The Secret to 10 Million Concurrent Connections -The Kernel is the Problem, Not the Solution

Stephen Williams sdw at lig.net
Tue May 21 08:55:13 PDT 2013


On 5/21/13 8:24 AM, J. Andrew Rogers wrote:
> On May 20, 2013, at 4:33 PM, Stephen Williams <sdw at lig.net> wrote:
>> Sure, direct drive scatter / gather async request management has been around for Unix on SGI, Sun, etc. forever, used by Sybase & Oracle at least.  The SGI server that ran the database for Buddylist did that (1995), with the somewhat broken side effect on SGI Unix that it had to poll to do it so the CPU was always at 100% for the database threads.
>
> I don't know the details of every UNIX implementation but my general impression was that these implementations are broken for the purposes of building exokernel-like applications. There are several "standard" ways to do async I/O in Linux but virtually all of them cannot be used for exokernel application construction. In 2013, there is a reasonable way to do most things required by an exokernel-style app on Linux.
>
> It is not sufficient to have async requests for I/O; the methods must meet a number of other criteria.
>
>
>> Ideally, context switches are avoided and block traffic is exchanged between app threads and device drivers on dedicated threads with spinlock protected logical queues on heaps of blocks.
>
> That implies a lot more performance-destroying multithreading behavior than you want in a real implementation. Ideally, threads almost never interact and when they do it is in a way that pipelines locality. So-called lock-free structures are a bottleneck and should be used sparingly.

I was talking here about the synchronization between an application thread and a device driver.  They both need to read/write, 
ideally without context switches and minimal interrupts / blocking.

>
>
>> If you cared, and weren't using SSD, it wouldn't be that difficult to do the same for storage.  You'd just need a layout and logic that allows effective use of whatever block the drive decided to send you next.
>
> I don't get why not using SSD is important; the code is about the same as with an HDD. Also, exokernels let applications take full control of the I/O scheduling (ignoring what the storage firmware/driver might do). Being able to impose a partial order on I/O operations that reflects a dynamic dependency graph is where a lot of the value is. Asynchronous does not imply unordered.

As an example of something important in the past, scatter / gather on SSD isn't going to be a big improvement as it can be for 
SCSI (and presumably SATA, which is essentially SCSI).  The SSD has no rotational latency or heads to move so it probably can 
return responses in the same order as they were given.

>
>
>> Networking is also theoretically not difficult; I just don't want to work on that kind of plumbing when I have much shinier and novel items on my queue.
>
> Networking is the least important part; you can save it for last. The APIs in most operating systems are much better than for other forms of I/O in terms of letting you do something sensible. Also, for exokernel-like construction it often has the poorest support, so a double incentive to ignore it for as long as possible.

Which is why people have saved it for last.  But some applications don't need much beyond memory and networking, or they can 
defer the storage part to another system efficiently.
A basic problem with much of networking is the requirement for system calls and context switches per packet or at least per 
connection per I/O.  Ideally, you get and async circular buffer of shared memory for input and output.  Then you can process 
many packets with infrequent or no system calls.  That's what I want anyway.  You can easily do this with communications 
concentrators, turning connection events into messages.  But using exokernel methods, this could all be done in a single box 
efficiently.

>
> Exokernel application skeletons are really neat pieces of elegant software with very dense functional complexity. Nothing quite like it in software. It is basically an OS kernel with none of the device driver stuff to worry about. (I've designed exokernel applications, we just didn't use the term "exokernel".)

Running as an app under another kernel that has virtualized device drivers?

sdw



More information about the FoRK mailing list