Teraflops Research Chip | |
Clock: | 5.67 GHz |
Cores: | 80 |
Transistors: | 100,000,000 |
Socket: | custom 1248-pin LGA (343 signal pins) |
Designfirm: | Intel Tera-Scale Computing Research Program |
Data-Width: | 38-bit |
Arch: | 96-bit VLIW |
Produced-Start: | 2006 |
Successor: | Xeon Phi |
Intel Teraflops Research Chip (codenamed Polaris) is a research manycore processor containing 80 cores, using a network-on-chip architecture, developed by Intel's Tera-Scale Computing Research Program.[1] It was manufactured using a 65 nm CMOS process with eight layers of copper interconnect and contains 100 million transistors on a 275 mm2 die.[2] Its design goal was to demonstrate a modular architecture capable of a sustained performance of 1.0 TFLOPS while dissipating less than 100 W. Research from the project was later incorporated into Xeon Phi. The technical lead of the project was Sriram R. Vangal.
The processor was initially presented at the Intel Developer Forum on September 26, 2006[3] and officially announced on February 11, 2007.[4] A working chip was presented at the 2007 IEEE International Solid-State Circuits Conference, alongside technical specifications.
The chip consists of a 10x8 2D mesh network of cores and nominally operates at 4 GHz.[5] Each core, called a tile (3 mm2), contains a processing engine and a 5-port wormhole-switched router (0.34 mm2) with mesochronous interfaces, with a bandwidth of 80 GB/s and latency of 1.25 ns at 4 GHz. The processing engine in each tile contains two independent, 9-stage pipeline, single-precision floating-point multiplyaccumulator (FPMAC) units, 3 KB of single-cycle instruction memory and 2 KB of data memory. Each FPMAC unit is capable of performing 2 single-precision floating-point operations per cycle. Each tile has thus an estimated peak performance of 16 GFLOPS at the standard configuration of 4 GHz. A 96-bit very long instruction word (VLIW) encodes up to eight operations per cycle. The custom instruction set includes instructions to send and receive packets into/from the chip's network and well as instructions for sleeping and waking a particular tile.[6] Underneath each tile, a 256 KB SRAM module (codenamed Freya) was 3D stacked, thus bringing memory nearer to the processor to increase overall memory bandwidth to 1 TB/s, at the expense of higher cost, thermal stress and latency, and a small total capacity of 20 MB.[7] The network of Polaris was shown to have a bisection bandwidth of 1.6 Tbit/s at 3.16 GHz and 2.92 Tbit/s at 5.67 GHz.[8] Other prominent features of the Teraflops Research chip include its fine-grained power management with 21 independent sleep regions on a tile and dynamic tile sleep, and very high energy efficiency with 27 GFLOPS/W theoretical peak at 0.6 V and 19.4 GFLOPS/W actual for stencil at 0.75 V.[9]
FPMAC | 9 | |
LOAD/STORE | 2 | |
SEND/RECEIVE | 2 | |
JUMP/BRANCH | 1 | |
STALL/WFD | ? | |
SLEEP/WAKE | 6 |
Stencil | 358K | 1.00 | 73.3% | 80 | |
SGEMM:Matrix multiplication | 2.63M | 0.51 | 37.5% | 80 | |
Spreadsheet | 64.2K | 0.45 | 33.2% | 80 | |
2D FFT | 196K | 0.02 | 2.73% | 64 |
0.60 V | 1.0 GHz | 0.32 TFLOPS | 11 W | 110 °C | ||
0.675 V | 1.0 GHz | 0.32 TFLOPS | 15.6 W | 80 °C | ||
0.70 V | 1.5 GHz | 0.48 TFLOPS | 25 W | 110 °C | ||
0.70 V | 1.35 GHz | 0.43 TFLOPS | 18 W | 80 °C | ||
0.75 V | 1.6 GHz | 0.51 TFLOPS | 21 W | 80 °C | ||
0.80 V | 2.1 GHz | 0.67 TFLOPS | 42 W | 110 °C | ||
0.80 V | 2.0 GHz | 0.64 TFLOPS | 26 W | 80 °C | ||
0.85 V | 2.4 GHz | 0.77 TFLOPS | 32 W | 80 °C | ||
0.90 V | 2.6 GHz | 0.83 TFLOPS | 70 W | 110 °C | ||
0.90 V | 2.85 GHz | 0.91 TFLOPS | 45 W | 80 °C | ||
0.95 V | 3.16 GHz | 1.0 TFLOPS | 62 W | 80 °C | ||
1.00 V | 3.13 GHz | 1.0 TFLOPS | 98 W | 110 °C | ||
1.00 V | 3.8 GHz | 1.22 TFLOPS | 78 W | 80 °C | ||
1.05 V | 4.2 GHz | 1.34 TFLOPS | 82 W | 80 °C | ||
1.10 V | 3.5 GHz | 1.12 TFLOPS | 135 W | 110 °C | ||
1.10 V | 4.5 GHz | 1.44 TFLOPS | 105 W | 80 °C | ||
1.15 V | 4.8 GHz | 1.54 TFLOPS | 128 W | 80 °C | ||
1.20 V | 4.0 GHz | 1.28 TFLOPS | 181 W | 110 °C | ||
1.20 V | 5.1 GHz | 1.63 TFLOPS | 152 W | 80 °C | ||
1.25 V | 5.3 GHz | 1.70 TFLOPS | 165 W | 80 °C | ||
1.30 V | 4.4 GHz | 1.39 TFLOPS | ? | 110 °C | ||
1.30 V | 5.5 GHz | 1.76 TFLOPS | 210 W | 80 °C | ||
1.35 V | 5.67 GHz | 1.81 TFLOPS | 230 W | 80 °C | ||
1.40 V | 4.8 GHz | 1.52 TFLOPS | ? | 110 °C |
Intel aimed to help software development for the new exotic architecture by creating a new programming model, especially for the chip, called Ct. The model never gained the following Intel hoped for and has been eventually incorporated into Intel Array Building Blocks, a now defunct C++ library.
FLOPSpeak= fmax ⋅ 80tiles ⋅ 2\tfrac{FPMAC