[FoRK] Software hacks using timestamp counters
sdw at lig.net
Mon Oct 1 11:29:16 PDT 2012
On 10/1/12 10:31 AM, J. Andrew Rogers wrote:
> On Sep 30, 2012, at 11:53 PM, "Stephen D. Williams" <sdw at lig.net> wrote:
>> For any system where TSC is useful, I'd be surprised to find that L2 coherency is so expensive that a single cache line retrieval is too expensive. That is especially true if you factor in the ability to use prefetch to hide that latency.
> Cache coherency latencies can be well beyond the cost of a TSC access and a TSC access is embarrassingly parallel -- it doesn't interact with the state of other cores. Prefetching is a micro-optimization that won't usefully apply much of the time. If a little prefetching actually worked then massively parallel barrel processors wouldn't have been invented, and even barrel processors are susceptible to memory stalls.
>> If your software is designed so that every core is isolated from every other core, why is using TSC to make task decisions useful?
> Threading does not enter into it at all; if there are thread interactions then you've lost the plot and TSC won't add any value. The CPU can be stolen from a single process in multiple circumstances and I/O tends to be interrupt-driven in any case -- servers usually care about I/O. There are also some more complicated side effects if you are taking advantage of hyper-threads. The TSC lets you keep track of this stolen or lost CPU time.
> I am talking about many concurrent logical processes implemented as a single physical process/thread locked to a core. This largely eliminates a major waste of CPU in threaded, shared memory models: cache coherency, non-local memory access, context switching, and locks/fences. In practice, this leads to a substantial performance boost on modern CPUs.
> Because you only have one physical thread on a core that does not work with any other threads, everything the thread does needs to be non-blocking. In short, pervasive polling behavior. This leads to another source of waste: lots of EAGAIN, EWOULDBLOCK, and expensive syscalls that only do a tiny amount of work. If you understand the latency structure of your code and hardware environment and start measuring actual latencies, you can estimate the probability of success or failure of non-blocking syscalls, the amount of work that will be available, etc and adjust your schedule accordingly.
> For trivial event schedules, such as one that only is non-blocking for simple network I/O, this doesn't matter. When you have a complicated set of resources -- multiple classes of network, multiple classes of disk, paging, etc all being managed by a single non-blocking process -- intelligently scheduled polling starts to add real value. (As to why you would want to schedule all of these things manually: performance.)
Ahh. In fact I experienced exactly that kind of effect, although not in
as hard-core way, for Buddylist at AOL in 1996-1997. Because
transactions were concentrated / batched for front-end I/O and back-end
database operations, the server process didn't fall apart once it hit
100% CPU, it just became more efficient because it was doing more work
for each I/O operation on average.
Definitely, in an I/O bound situation, you could estimate when buffers
have had time to accumulate significant data before doing system calls
to retrieve what's there. Avoiding unnecessary system calls is a huge
win. You could develop real-time statistics to choose an optimal wait
for each I/O channel type on the fly. In any situation where you have
too much computation to do, and perhaps a "do as much as you can" type
of situation, or when there is competitive work, this could be very useful.
It seems that right way to solve this problem might be to create a
communications concentrator interface to the kernel for handling many
TCP connections, many block device scatter/gather requests, etc. You
could even do the oft-imaged ACE (or whatever the Intel et al system
call bypass system was called) mechanism of avoiding system calls and
just filling in shared driver/process buffers with spinlocked flags.
>> Supercomputer-like systems with very fast multidimensional forwarding interconnects can have very low latency between hundreds or thousands of processors. This is true of the Cray XT series, as well as several 128+ processor ARM-based systems. In those, overhead of remote communication and transfer can be kept low enough to allow enough to allow synchronized job handling of the right granularity, especially if any type of prefetch and pipelining are used.
> You've been huffing their marketing literature. You scale codes on these architectures the same way you do on a big cluster: treating every single core as an isolated network-attached process. If you don't, it won't scale.
The point of those architecture is that some types of problems and
algorithms need, or seem to need, global synchronous communication and
stepping. Physics simulations of many kinds are the classic example.
While there have been experiments trying to avoid this some of the time,
as far as I know this is still standard. Those types of algorithms
won't run well on hardware with any large latencies between any two
nodes in a cluster. The programming structure often uses something like
MPI which has independent processes communicating with messages, but the
communication fabric is required to be any<->any with low latency and
Far more algorithms are embarrassing parallel in some ways, hence Hadoop
at al. For those, you usually wouldn't want specialized hardware unless
it was cheap. A number of algorithms are somewhere in the middle, often
with at least a little cost for distributed concurrency.
> I have done massively parallel algorithm design on all manner of exotic platforms (it is my mediocre superpower -- parallelize anything on any silicon) like you are envisioning above, including ones with far more sophisticated latency hiding than what you are dreaming up, and it solves nothing. The fundamental model -- a network of individual cores with local RAM -- is the same for every architecture if you need massive parallelism. In fact, the most difficult platforms to parallelize are those that intentionally obscure this model (usually some misguided "ease of use" notion). This technology has been around a long time; all the good parts are already in your desktop.
> You can't brute-force your way around transaction theoretic limits with magical hardware. Disjoint-parallel algorithms are the only scalable solution.
Vectorized, SIMD hardware, when it applies, can be an exception for
applicable problems. The ability to do 8 - 128 operations at once can
make a huge difference.
More information about the FoRK