[FoRK] Software hacks using timestamp counters

Stephen Williams sdw at lig.net
Mon Oct 1 09:23:52 PDT 2012

On 10/1/12 8:42 AM, Eugen Leitl wrote:
> On Mon, Oct 01, 2012 at 08:26:03AM -0700, Stephen Williams wrote:
>> For performance, memory movement should usually be carefully
>> orchestrated.  The application, or better the framework through
>> declarative relationships, should manage which tasks follow the memory to
>> minimize bandwidth and latency.  If two cores share L2, that should be
>> taken advantage of.
> I'm expecting that once the new paradigm has become accepted
> and we're in kilonode and meganode country there won't be
> shared memory, or shared cache, for that matter. Arguably,
> there is not much need for cache is your memory is sitting
> within the node, or at least on top of the node die.
> So there will be less multicores/node, but a single core/node,
> but a very large numbers of nodes. The overhead for sending
> messages to adjacent dies will be like cache hits, and worst
> case will be like cache misses.

I don't think we're going back to a single core in any real case unless we are actually sprinkling cores within memory and 
melding it to managing memory bus transactions.  We'll likely always have some degree of SMP attached to each memory / memory 
channel(s).  The memory bandwidth is large, the power overhead of the memory is significant, and multiple cores are now cheap 
and efficient. Compute modules will definitely be federated in a fabric, so everything you say is true, just at the bottom there 
are still SMP problems & opportunities.

>> What I'm implementing is somewhat MPI-like: A single fixed thread for
>> each CPU core that interacts with other cores by updating a buffer and
>> queuing jobs, some of which get picked up by other cores.
> Do you think developers can accept a subset of MPI as their
> core communication paradigm, or is that too low level?

I'm solving our specific problem in a way that will apply to a larger class of problems, but not necessarily everything.  My 
solution is more abstract than MPI and at a higher granularity than OpenMP or AMX.  I'm solving things somewhat similarly to 
Intel TBB although with much simpler code.

We need buffer, L2, core, device (CPU/GPU/DSP), and parameter management and optimization while chaining a number of compute 
intensive modules on large amounts of mostly use-once data in, for some modes, a highly repetitive environment.  All while 
keeping the functional code clean and highly reconfigurable at compile or runtime with several alternate versions.


More information about the FoRK mailing list