[FoRK] parallelism Re: Welcome to the era of radical innovation
J. Andrew Rogers
andrew at jarbox.org
Sun Feb 9 22:19:47 PST 2014
On Jan 13, 2014, at 2:29 PM, Dr. Ernest Prabhakar <drernie at radicalcentrism.org> wrote:
> I *almost* understood you. Could you go into a bit more detail about the difference between latency-aware and schedule-aware?
Parallel algorithm design focuses on minimizing two things: an idle core when useful work can be done (e.g. blocking) and a busy core that is not doing any useful work (e.g. context switching). If you cut away these two things effectively then efficient parallelism, to the extent it is expressible, is what remains. Distributed coordination protocols tend to include a lot of both. This is a different school than most of what passes for parallel algorithm design but it is the only school that works when the parallelism is massive.
Runtime coordination is useful because it is universal. It can handle the case where every process has little information about the nature and behavior of other concurrent processes. However, in real systems we often have significant information about the runtime environment. This information can be cleverly embedded in the design of the algorithm itself such that we can guarantee that concurrent processes rarely or never conflict even though they do not coordinate at runtime.
Two types of environment information that can be leveraged in this way are latency and process schedule.
Latency-aware algorithms are basically clockwork. If you know timing and topology of the parallel execution environment, you can construct an algorithm where each process only touches a shared resource during a window when we can guarantee no other process is touching that resource, and that guarantee is extremely inexpensive. In practice, this is a micro-optimization that typically buys you a single digit integer factor improvement on commodity silicon. Even on a single core, you can get 2x throughput improvement this way by exploiting the parallelism internal to the core (compilers are currently terrible at exploiting this). In a big enough system latency is too difficult to predict for this algorithm design technique to work.
Schedule-aware algorithms “know” the scheduling behavior of the subset of nodes they interact with. This technique works at extreme scales but requires much higher skill. Analysis of such algorithms are considerably more complicated than latency-aware algorithms because they are Nash problems. Basically, if you know the operation scheduling behavior of every node you interact with, you can schedule your own behavior to avoid conflicts given sufficient resources available to the scheduler. The maximum resources required to avoid pathological conflict is a function of the number of nodes you interact with and the scheduling algorithms of other nodes. In practice, the worst-case resource requirements are computable and completely reasonable in commodity computing environments.
I’ve been doing this a long time. The latency-aware stuff is pretty obvious; you can teach it quickly, though it has limited utility. The schedule-aware stuff is really, really non-obvious and most people never grok it; many things about it violate intuition. However, schedule-aware algorithms scale like nothing else.
> And where do tools like Halide fit into your taxonomy?
Halide is cool. I am not an expert on it but as far as I can tell it is a classical but excellent parallel code generator for its domain. It does not attempt to solve the design problems I mention above, at least not directly.
More information about the FoRK