[FoRK] Software hacks using timestamp counters

Stephen Williams sdw at lig.net
Mon Oct 1 11:29:16 PDT 2012

On 10/1/12 10:31 AM, J. Andrew Rogers wrote:
> On Sep 30, 2012, at 11:53 PM, "Stephen D. Williams" <sdw at lig.net> wrote:
>> For any system where TSC is useful, I'd be surprised to find that L2 coherency is so expensive that a single cache line retrieval is too expensive.  That is especially true if you factor in the ability to use prefetch to hide that latency.
> Cache coherency latencies can be well beyond the cost of a TSC access and a TSC access is embarrassingly parallel -- it doesn't interact with the state of other cores. Prefetching is a micro-optimization that won't usefully apply much of the time. If a little prefetching actually worked then massively parallel barrel processors wouldn't have been invented, and even barrel processors are susceptible to memory stalls.
>> If your software is designed so that every core is isolated from every other core, why is using TSC to make task decisions useful?
> Threading does not enter into it at all; if there are thread interactions then you've lost the plot and TSC won't add any value. The CPU can be stolen from a single process in multiple circumstances and I/O tends to be interrupt-driven in any case -- servers usually care about I/O. There are also some more complicated side effects if you are taking advantage of hyper-threads. The TSC lets you keep track of this stolen or lost CPU time.
> I am talking about many concurrent logical processes implemented as a single physical process/thread locked to a core. This largely eliminates a major waste of CPU in threaded, shared memory models: cache coherency, non-local memory access, context switching, and locks/fences. In practice, this leads to a substantial performance boost on modern CPUs.
> Because you only have one physical thread on a core that does not work with any other threads, everything the thread does needs to be non-blocking. In short, pervasive polling behavior. This leads to another source of waste: lots of EAGAIN, EWOULDBLOCK, and expensive syscalls that only do a tiny amount of work. If you understand the latency structure of your code and hardware environment and start measuring actual latencies, you can estimate the probability of success or failure of non-blocking syscalls, the amount of work that will be available, etc and adjust your schedule accordingly.
> For trivial event schedules, such as one that only is non-blocking for simple network I/O, this doesn't matter. When you have a complicated set of resources -- multiple classes of network, multiple classes of disk, paging, etc all being managed by a single non-blocking process -- intelligently scheduled polling starts to add real value. (As to why you would want to schedule all of these things manually: performance.)

Ahh.  In fact I experienced exactly that kind of effect, although not in 
as hard-core way, for Buddylist at AOL in 1996-1997.  Because 
transactions were concentrated / batched for front-end I/O and back-end 
database operations, the server process didn't fall apart once it hit 
100% CPU, it just became more efficient because it was doing more work 
for each I/O operation on average.

Definitely, in an I/O bound situation, you could estimate when buffers 
have had time to accumulate significant data before doing system calls 
to retrieve what's there.  Avoiding unnecessary system calls is a huge 
win.  You could develop real-time statistics to choose an optimal wait 
for each I/O channel type on the fly.  In any situation where you have 
too much computation to do, and perhaps a "do as much as you can" type 
of situation, or when there is competitive work, this could be very useful.

It seems that right way to solve this problem might be to create a 
communications concentrator interface to the kernel for handling many 
TCP connections, many block device scatter/gather requests, etc.  You 
could even do the oft-imaged ACE (or whatever the Intel et al system 
call bypass system was called) mechanism of avoiding system calls and 
just filling in shared driver/process buffers with spinlocked flags.
>> Supercomputer-like systems with very fast multidimensional forwarding interconnects can have very low latency between hundreds or thousands of processors.  This is true of the Cray XT series, as well as several 128+ processor ARM-based systems.  In those, overhead of remote communication and transfer can be kept low enough to allow enough to allow synchronized job handling of the right granularity, especially if any type of prefetch and pipelining are used.
> You've been huffing their marketing literature. You scale codes on these architectures the same way you do on a big cluster: treating every single core as an isolated network-attached process. If you don't, it won't scale.

The point of those architecture is that some types of problems and 
algorithms need, or seem to need, global synchronous communication and 
stepping.  Physics simulations of many kinds are the classic example.  
While there have been experiments trying to avoid this some of the time, 
as far as I know this is still standard.  Those types of algorithms 
won't run well on hardware with any large latencies between any two 
nodes in a cluster.  The programming structure often uses something like 
MPI which has independent processes communicating with messages, but the 
communication fabric is required to be any<->any with low latency and 
high bandwidth.

Far more algorithms are embarrassing parallel in some ways, hence Hadoop 
at al.  For those, you usually wouldn't want specialized hardware unless 
it was cheap.  A number of algorithms are somewhere in the middle, often 
with at least a little cost for distributed concurrency.
> I have done massively parallel algorithm design on all manner of exotic platforms (it is my mediocre superpower -- parallelize anything on any silicon) like you are envisioning above, including ones with far more sophisticated latency hiding than what you are dreaming up, and it solves nothing. The fundamental model -- a network of individual cores with local RAM -- is the same for every architecture if you need massive parallelism. In fact, the most difficult platforms to parallelize are those that intentionally obscure this model (usually some misguided "ease of use" notion). This technology has been around a long time; all the good parts are already in your desktop.
> You can't brute-force your way around transaction theoretic limits with magical hardware. Disjoint-parallel algorithms are the only scalable solution.

Vectorized, SIMD hardware, when it applies, can be an exception for 
applicable problems.  The ability to do 8 - 128 operations at once can 
make a huge difference.


More information about the FoRK mailing list