[FoRK] [info] arstechnica: more on TILE64

Eugen Leitl <eugen at leitl.org> on Tue Aug 21 05:47:03 PDT 2007

----- Forwarded message from Alejandro Dubrovsky <alito at organicrobot.com> -----

From: Alejandro Dubrovsky <alito at organicrobot.com>
Date: Tue, 21 Aug 2007 21:56:46 +1000
To: info <info at postbiota.org>
Subject: [info] arstechnica: more on TILE64
X-Mailer: Evolution 2.10.2 


MIT startup raises multicore bar with new 64-core CPU

By Jon Stokes | Published: August 20, 2007 - 12:41PM CT
A "sea change in the computing industry"

A new startup out of MIT emerged from stealth mode today to announce
that they're shipping a 64-core processor for the embedded market. The
company, called Tilera, was founded by Dr. Anat Agarwal, the MIT
professor behind the famous and venerable Raw project on which Tilera's
first product, the TILE64 processor, is based. Tilera's director of
marketing, Bob Dowd, told Ars that TILE64 represents a "sea change in
the computing industry," and the company's CEO isn't shy about pitching
the chip as the "first significant new chip architectural development in
a decade." So let's take an initial look at what was announced about
TILE64 today, with further information to follow as it becomes

Tell me if this sounds familiar: a grid of processor "tiles" arranged in
a mesh network, where each tile houses a general purpose processor,
cache, and a non-blocking router that the tile uses to communicate with
the other tiles on the chip. If you've followed my coverage of Intel's
Terascale research project—especially the 80-core Polaris prototype—then
you know that this description fits what Intel has been working on for
the past few years and aggressively publicizing for a year or so.

But the basic tile + processor/cache + router + mesh network idea was
pioneered by Dr. Agarwal and MIT's RAW project about a decade ago, and
now those same ideas also form the basis for TILE64. TILE64 consists of
a mesh network of 64 tiles, with each tile containing a general-purpose
processor core and a non-blocking router. The short-pipeline, in-order,
three-issue cores implement a MIPS-derived VLIW ISA with a few important
and peculiar features.

Tilera's PR department is extremely focused on the mesh network and
larger SoC architectures as the initial selling points of the processor,
so information on the individual cores is hard to come by. Based on my
discussion with Tilera and the diagrams that the company provided (see
below), each core has a register file and three functional units: two
integer ALUs and a load-store unit. The cores also have a split L1 cache
(probably 16K), and a 64K chunk of L2 that has an interesting feature.
When there's a miss in one core's L2, the core checks the L2 caches of
the other cores for the needed data before propagating the miss out to
main memory. In this respect, the L2s collectively act like a large 4MB

As you can probably make out from the diagram above, TILE64 has four
DDR2 controllers, two 10-gigabit Ethernet interfaces, two gigabit
Ethernet interfaces, two four-lane PCIe interfaces, and a flexible I/O
interface that can be software-configured to handle a number of

TILE64 is fabbed on TSMC's trailing-edge 90nm process and runs at speeds
from 600MHz to 900Mhz. The launch of a 90nm product at a time when the
processor market is moving from 65nm to 45nm was undoubtedly done in
order to keep costs down. Tilera won't be able to afford to migrate this
product to a smaller process node until they get enough volume to
justify the investment.

The initial entries in the TILE64 line are now shipping on PCIe
daughterboards for development and production purposes. The processor is
also available in lots of 10,000 for $435, and further entries to the
TILE family are planned to include different core counts.
Raw roots

It's unfortunate for Tilera that Intel has had such success in
publicizing the tile- and mesh-based ideas using Terascale as a branding
vehicle, because Agarwal and Co. really did get there earlier. But if
TILE64 is true to its Raw roots—and I have some indications that it is—
then there are a few more interesting things going on below the surface
that are worth looking at.

The basic idea behind Raw, a project that was started well before
Moore's Law stopped delivering huge clockspeed increases, was that as
the number of transistors on a chip increases, wire delay becomes an
architecturally significant factor in chip design. The Alpha 21164,
Pentium 4, and PowerPC 970 are all examples of the first wave of
commodity processors that had made major microarchitectural concessions
(i.e., dedicated pipeline stages and increased load-use latencies for
certain sequences of integer operations) for wire delay, but the effects
of wire delay were still hidden from programmers as far as possible.

Agarwal's idea was to expose wire delay to programmers via the ISA. The
Raw project, the name of which seems to be a recursive acronym for "Raw
Architecture Workstation," exposes wire delay to the programmer as hops
on an on-chip mesh network. It takes one cycle for data to move from one
tile to the next, with the result that a compiler can statically
schedule operations among multiple tiles' ALUs by taking into account
the exact number of cycles that it takes for a result to propagate
across the chip.

TILE64 inherits this one-cycle-per-tile feature of Raw, so this type of
static scheduling is still possible if you want to write to the bare
metal. However, none of this is publicized in the Raw press materials,
with ease of programming via Linux and an ANSI C toolchain being
emphasized instead.

Note that Raw also had a special bypass network that could take a result
from one tile's ALU and route it directly into another tile's ALU so
that the compiler could use this network to schedule dependent integer
ops. This is a neat trick, and the TILE line inherits this ability;
there are four special registers to which an ALU can write to have data
sent out directly over the network to another tile.

Wire delay isn't the only microarchitectural feature that Raw exposes to
the compiler/programmer. Indeed, Raw was a pretty thorough attempt to
repackage the RISC philosophy of "show everything to the compiler and
let compiler writers manage the complexity" for the
+100-million-transistor era. Even the pin-outs in the package were put
under software control as "ports." The result was that you could use Raw
to create a kind of "software ASIC" by compiling what Raw programmers
referred to as a "software circuit"—an application that's totally tuned
down to the cycle level to fit a specific Raw implementation.

I'm due to talk to the head of Tilera's software team, which is actually
larger than the company's hardware team, later today, so I can post an
update when I find out more. But my sense is that all of this software
complexity is still there in TILE64, but it's hidden from most
programmers by a very complex and carefully written toolchain that keeps
Raw's "network hops" and "ports" and memory hierarchy management as far
away from most programmers as possible.
Performance and market positioning

TILE64 is initially being pitched at the embedded market, with
wire-speed network processing and HD media encoding being the two main
application scenarios that Tilera wants to see it used in. Each TILE64
processor is capable of encoding two simultaneous streams of H.264
video, and over ten streams of broadcast-quality high definition video.

Tilera claims that TILE64 shows a 30x performance per watt advantage
over a 3GHz Xeon running a SNORT benchmark, with the new chip able to
run the benchmark at 10Gbps speeds. (Note that I've asked for more
details on this benchmark run, so I'll publish them when the company
gets back to me.) Also claimed for TILE64 is a 40x performance advantage
over a TI DM648 DSP chip on a 16x16 SAD (sum of absolute differences)

I have to confess that Tilera's choice of benchmark bake-off opponents
is a bit odd to me, but now that the product is shipping, I expect to
see it benched against, say GPGPU products or IBM's Cell in media
encoding bake-offs and chips like Sun's UtraSPARC T2 in massively
multithreaded integer and floating-point workloads. After all, the US T2
can handle 64 simultaneous threads of execution just like the TILE64,
and it contains similar network and memory interface hardware, so the
two should be a good match-up. Tilera claims that TILE64 dissipates
between 170 and 300 milliwatts per tile, which compares very favorable
to the US T2's already low 2 watts per thread.

Every many-core processor story that I or anyone else writes nowadays is
really a software story; efficiently and effectively programming
many-core chips is by no means a solved problem, and my money says that
the Raw "expose everything to software" approach makes the challenge
that much greater. So what will make or break Tilera is not how many
peak theoretical operations per second it's capable of (Tilera claims
192 billion 32-bit ops/sec), nor how energy-efficient its mesh network
is, but how easy it is for programmers to extract performance from the
device. That's the critical piece of TILE64's launch story that's
missing right now, and it's what I'll keep an eye out for as I watch
this product make its way in the market.

Though there are any number of questions about this product that remain
to be answered, one thing is for certain: TILE64 has indeed brought us
into the era of 64 general-purpose, mesh-networked processor cores on a
single chip, and that's a major milestone. So take a good look at
TILE64, because regardless of what happens to Tilera, this is probably
what the many-core era looks like.

Update: I originally speculated that the processors cores each had two
ALUs and an FPU, but that's incorrect. A Tilera rep has informed me that
each core has two ALUs. I've also briefly updated the relevant parts of
the article with some new, post-launch information about the processor,
and I look forward to posting a follow-up that goes into a little more
detail on the microarchitectural and software aspects of TILE.


info mailing list
info at postbiota.org

----- End forwarded message -----
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE

More information about the FoRK mailing list