[FoRK] Software hacks using timestamp counters
J. Andrew Rogers
andrew at jarbox.org
Mon Oct 1 10:31:53 PDT 2012
On Sep 30, 2012, at 11:53 PM, "Stephen D. Williams" <sdw at lig.net> wrote:
> For any system where TSC is useful, I'd be surprised to find that L2 coherency is so expensive that a single cache line retrieval is too expensive. That is especially true if you factor in the ability to use prefetch to hide that latency.
Cache coherency latencies can be well beyond the cost of a TSC access and a TSC access is embarrassingly parallel -- it doesn't interact with the state of other cores. Prefetching is a micro-optimization that won't usefully apply much of the time. If a little prefetching actually worked then massively parallel barrel processors wouldn't have been invented, and even barrel processors are susceptible to memory stalls.
> If your software is designed so that every core is isolated from every other core, why is using TSC to make task decisions useful?
Threading does not enter into it at all; if there are thread interactions then you've lost the plot and TSC won't add any value. The CPU can be stolen from a single process in multiple circumstances and I/O tends to be interrupt-driven in any case -- servers usually care about I/O. There are also some more complicated side effects if you are taking advantage of hyper-threads. The TSC lets you keep track of this stolen or lost CPU time.
I am talking about many concurrent logical processes implemented as a single physical process/thread locked to a core. This largely eliminates a major waste of CPU in threaded, shared memory models: cache coherency, non-local memory access, context switching, and locks/fences. In practice, this leads to a substantial performance boost on modern CPUs.
Because you only have one physical thread on a core that does not work with any other threads, everything the thread does needs to be non-blocking. In short, pervasive polling behavior. This leads to another source of waste: lots of EAGAIN, EWOULDBLOCK, and expensive syscalls that only do a tiny amount of work. If you understand the latency structure of your code and hardware environment and start measuring actual latencies, you can estimate the probability of success or failure of non-blocking syscalls, the amount of work that will be available, etc and adjust your schedule accordingly.
For trivial event schedules, such as one that only is non-blocking for simple network I/O, this doesn't matter. When you have a complicated set of resources -- multiple classes of network, multiple classes of disk, paging, etc all being managed by a single non-blocking process -- intelligently scheduled polling starts to add real value. (As to why you would want to schedule all of these things manually: performance.)
> Supercomputer-like systems with very fast multidimensional forwarding interconnects can have very low latency between hundreds or thousands of processors. This is true of the Cray XT series, as well as several 128+ processor ARM-based systems. In those, overhead of remote communication and transfer can be kept low enough to allow enough to allow synchronized job handling of the right granularity, especially if any type of prefetch and pipelining are used.
You've been huffing their marketing literature. You scale codes on these architectures the same way you do on a big cluster: treating every single core as an isolated network-attached process. If you don't, it won't scale.
I have done massively parallel algorithm design on all manner of exotic platforms (it is my mediocre superpower -- parallelize anything on any silicon) like you are envisioning above, including ones with far more sophisticated latency hiding than what you are dreaming up, and it solves nothing. The fundamental model -- a network of individual cores with local RAM -- is the same for every architecture if you need massive parallelism. In fact, the most difficult platforms to parallelize are those that intentionally obscure this model (usually some misguided "ease of use" notion). This technology has been around a long time; all the good parts are already in your desktop.
You can't brute-force your way around transaction theoretic limits with magical hardware. Disjoint-parallel algorithms are the only scalable solution.
More information about the FoRK