Spatial Computation

Spatial Computation

Spatial Computation Mihai Budiu CMU CS Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze SCS Ph.D. Thesis defense, December 8, 2003 Spatial Computation Thesis committee: Seth Goldstein A model of general-purpose computation Peter Lee SCS based on Application-Specific Hardware. Todd Mowry

Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 2 Thesis Statement Application-Specific Hardware (ASH): can be synthesized by adapting software compilation for predicated architectures, provides high-performance for programs with high ILP, with very low power consumption, is a more scalable and efficient computation substrate than monolithic processors. 3 Outline Introduction Compiling for ASH

Media processing on ASH ASH vs. superscalar processors Conclusions 4 CPU Problems Complexity Power Global Signals Limited ILP 5 Design Complexity Design Time: CAD productivity favors FPL 100,000,000 10,000,000

Logic transistors/chip .1 0 1,000,000 1,000,000 10,000 100,000 10,000 1000 100 10 x x x x x 1000 x

100 21%/Year 2009 2005 2007 2003 2001 1999 1997 1993 1995 1991 1989

1987 1985 10 1983 2 .5 x 1981 Chip size (K transistors) 58%/Year Productivity 100,000 .3 5 10,000,000

Transistors/staff*month Source: S. Malik, orig Sematech from Michael Flynns FCRC 2003 talk 6 Communication vs. Computation gate wire 5ps 20ps Power consumption on wires is also dominant 7 Our Approach: ASH Application-Specific Hardware 8

Resource Binding Time 1. 1. 2. Programs 2. CPU Programs ASH 9 Hardware Interface software software ISA virtual ISA

hardware CPU hardware gates ASH 10 Application-Specific Hardware C program Dataflow IR Compiler Reconfigurable/custom hw 11 Contributions Embedded systems Asynchronous

circuits Computer architecture Reconfigurable computing Compilation High-level synthesis Nanotechnology Dataflow machines s y s m te s theory 12

Outline Introduction CASH: Compiling for ASH Media processing on ASH ASH vs. superscalar processors Conclusions 13 Computation = Dataflow Programs Circuits a x = a & 7; ... y = x >> 2; 7 & x 2

>> Operations ) functional units Variables ) wires No interpretation 14 Basic Operation + latch data ack valid 15 Asynchronous Computation + data 1

latch 5 + + + ack valid 2 + 3 + 6 4

+ 7 + 8 16 Distributed Control Logic FSM ack rdy + short, local wires asynchronous control 17 Forward Branches b if (x > 0) y = -x; else

y = b*x; x * 0 - > ! y Conditionals ) Speculation critical path 18 Control Flow ) Data Flow data Merge (label) data data

predicate Gateway Split (branch) p ! 19 0 Loops i int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; * 0

+1 < 100 sum + ! ret 20 Predication and Side-Effects addr token sequencing of side-effects no speculation pred Load data

to memory token 21 Thesis Statement Application-Specific Hardware: can be synthesized by adapting software compilation for predicated architectures, provides high-performance for programs with high ILP, with very low power consumption, is a more scalable and efficient computation substrate than monolithic processors. 22 Outline Introduction CASH: Compiling for ASH An optimization on the SIDE Media processing on ASH ASH vs. superscalar processors

Conclusions skip to 23 Availability Dataflow Analysis y = a*b; ... if (x) { ... y ... = a*b; } 24 Dataflow Analysis Is Conservative if (x) { ... y = a*b; } ... y ... = a*b;

? 25 Static Instantiation, Dynamic Evaluation flag = false; if (x) { ... y = a*b; flag = true; } ... ... = flag ? y : a*b; 26 0 epic_d 10 5 300.twolf

300.twolf 254.gap 197.parser 181.mcf 176.gcc 35 175.vpr 164.gzip 188.ammp 183.equake 147.vortex 134.perl

132.ijpeg 130.li 129.compress 124.m88ksim 099.go mesa rasta pgp_d pgp_e g721_d g721_e pegwit_d pegwit_e

jpeg_d jpeg_e mpeg2_d 197.parser 254.gap epic_d mpeg2_e 176.gcc epic_e 181.mcf 175.vpr gsm_d gsm_e

% reduction 40 164.gzip 188.ammp 183.equake adpcm_d 15 adpcm_e %st promo %st PRE 134.perl 0 147.vortex 132.ijpeg

Stores 130.li 129.compress 30 124.m88ksim 099.go mesa rasta pgp_d pgp_e g721_d g721_e pegwit_d

pegwit_e jpeg_d jpeg_e mpeg2_d mpeg2_e 20 epic_e 25 gsm_d gsm_e adpcm_d adpcm_e

SIDE Register Promotion Impact 45 Loads % ld promo % ld PRE 30 25 20 15 53 10 5 27 Outline Introduction

CASH: Compiling for ASH Media processing on ASH ASH vs. superscalar processors Conclusions 28 Performance Evaluation ASH LSQ L1 8K L2 1/4M Mem limited BW CPU: 4-way OOO Assumption: all operations have the same latency.

29 5.8 rasta pegwit_e pegwit_d mpeg2_e mpeg2_d mesa 5.8 jpeg_e jpeg_d 3 gsm_e

gsm_d g721_e g721_d epic_e epic_d adpcm_e adpcm_d Times faster Media Kernels, vs 4-way OOO 12 2.5 2 1.5

1 0.5 0 30 rasta pegwit_e pegwit_d mpeg2_e mpeg2_d mesa jpeg_e jpeg_d

gsm_e gsm_d g721_e g721_d epic_e epic_d adpcm_e adpcm_d Media Kernels, IPC 25 Base IPC ASH IPC 20 15

10 5 4 0 31 10 Speed-up rasta pegwit_e pegwit_d mpeg2_e mpeg2_d mesa

jpeg_e jpeg_d gsm_e gsm_d g721_e g721_d epic_e 8 epic_d 9 adpcm_e adpcm_d

Times bigger Speed-up IPC Correlation 12 IPC Ratio 7 6 5 4 3 2 1 0 32

Low-Level Evaluation C CASH core Results shown so far. All results in thesis. Verilog back-end Synopsys, Cadence P/R 180nm std. cell library, 2V ~1999 technology Results in the next two slides. ASIC 33 pe gw

_e it_ e it_ d pe g2 2_ d pe gw m _e _d eg

_d m pe g jp gs m gs m g7 21 _e _d _d g7 21

ad pc m Square mm 12 Area 10 8 6 4 2 0 Reference: P4 in 180nm has 217mm2 34 Power

350 Times smaller than OOO 300 250 200 150 100 50 0 power ratio adpcm_d g721_d g721_e gsm_d gsm_e jpeg_d

mpeg2_d mpeg2_e pegwit_d pegwit_e 70 41 41 129 147 94 121 136

303 303 vs 4-way OOO superscalar, 600 Mhz, with clock gating (Wattch), ~ 6W 35 Thesis Statement Application-Specific Hardware: can be synthesized by adapting software compilation for predicated architectures, provides high-performance for programs with high ILP, with very low power consumption, is a more scalable and efficient computation substrate than monolithic processors. 36 Outline Introduction CASH: Compiling for ASH Media processing on ASH dataflow pipelining ASH vs. superscalar processors Conclusions

skip to 37 i Pipelining * + <= pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; 100

1 sum + cycle=1 38 i Pipelining * 100 1 + <= sum

+ cycle=2 39 i Pipelining * 100 1 + <= sum + cycle=3 40

i Pipelining * 100 1 + <= sum + cycle=4 41 i Pipelining

i=1 100 1 + <= i=0 sum + pipeline balancing cycle=5 42 Outline

Introduction CASH: Compiling for ASH Media processing on ASH ASH vs. superscalar processors Conclusions 43 This Is Obvious! ASH runs at full dataflow speed, so CPU cannot do any better (if compilers equally good). 44 Percent slower / faster -10 0 147.vortex 134.perl 132.ijpeg

130.li 129.compress 124.m88ksim -20 099.go SpecInt95, ASH vs 4-way OOO 30 20 10 -30 -40 -50 45

Branch Prediction i ASH crit path for (i=0; i < N; i++) { ... CPU crit path 1 + < exception if (exception) break; } Predicted not taken Effectively a noop for CPU! Predicted taken. !

& result available before inputs 46 Percent slower/faster -20 60 147.vortex 134.perl 132.ijpeg 130.li 129.compress 124.m88ksim -40 099.go

SpecInt95, perfect prediction baseline prediction 40 20 no data 0 -60 47 ASH Problems Both branch and join not free Static dataflow (no re-issue of same instr) Memory is far Fully static No branch prediction No dynamic unrolling No register renaming Calls/returns not lenient ...

48 Thesis Statement Application-Specific Hardware: can be synthesized by adapting software compilation for predicated architectures, provides high-performance for programs with high ILP, with very low power consumption, is a more scalable and efficient computation substrate than monolithic processors. 49 Outline Introduction + CASH: Compiling for ASH + Media processing on ASH + ASH vs. superscalar processors = Conclusions 50 Strengths low power simple verification?

economies of scale highly optimized specialized to app. unlimited ILP simple hardware no fixed window branch prediction control speculation full-dataflow global signals/decision 51

Conclusions Compiling around the ISA is a fruitful research approach. Distributed computation structures require more synchronization overhead. Spatial Computation efficiently implements high-ILP computation with very low power. 52 Backup Slides Control logic Pipeline balancing Lenient execution Dynamic Critical Path Memory PRE Critical path analysis CPU + ASH 53 Control Logic rdyin C C

ackin ackout rdyout datain Reg dataout back back to talk 54 Last-Arrival Events + data

ack valid Event enabling the generation of a result May be an ack Critical path=collection of last-arrival edges 55 Dynamic Critical Path 3. Some edges may repeat 2. Trace back along last-arrival edges 1. Start from last node back back to analysis 56 Critical Paths b if (x > 0) y = -x;

else y = b*x; x * 0 - > ! y 57 Lenient Operations b if (x > 0) y = -x; else y = b*x; x *

0 - > ! y Solve the problem of unbalanced paths back back to talk 58 i Pipelining * i=1 100

1 + <= i=0 sum + cycle=6 59 i Pipelining * + <= is loop

predicate 100 1 Long latency pipe sum sums loop + cycle=7 60 i Pipelining is loop *

critical path 100 1 + <= Predicate ack edge is on the critical path. sum sums loop + 61 Pipelinine balancing * i

100 1 + <= is loop decoupling FIFO sum sums loop + cycle=7 62 Pipelinine balancing *

is loop i 100 1 + <= critical path decoupling FIFO sum sums loop + back back to presentation

63 Register Promotion (p1) *p= (p1) *p= (p2) =*p (p2 : p1) =*p Load is executed only if store is not 64 Register Promotion (2) (p1) *p= (p2) =*p (p1) *p= (false)

=*p When p2 ) p1 the load becomes dead... ...i.e., when store dominates load in CFG back 65 PRE (p1) ...=*p (p2) ...=*p (p1 p2) ...=*p This corresponds in the CFG to lifting the load to a basic block dominating the original loads 66 Store-store (1) (p1) *p=

(p1 : p2) *p= (p2) *p=... (p2) *p=... When p1 ) p2 the first store becomes dead... ...i.e., when second store post-dominates first in CFG 67 Store-store (2) (p1) *p= (p1 : p2) *p= (p2) *p=... (p2) *p=... Token edge eliminated, but... ...transitive closure of tokens preserved back 68

A Code Fragment for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; Y[i] = X[j].q; } SpecINT95:124.m88ksim:init_processor, stylized 69 Dynamic Critical Path definition sizeof(X[j]) load predicate loop predicate for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; 70

MIPS gcc Code LOOP: L1: beq $v0,$a1,EXIT L2: addiu $v1,$v1,20 L3: lw $v0,0($v1) L4: addiu $a0,$a0,1 L5: bne $v0,$a3,LOOP EXIT: for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) ; X[j].r == i ; &X[j+1].r ; X[j+1].r ; j++ ; X[j+1].r == 0xF L1! L2 ! L3 ! L5 ! L1 4-instructions loop-carried dependence break; 71 If Branch Prediction Correct

LOOP: L1: beq $v0,$a1,EXIT L2: addiu $v1,$v1,20 L3: lw $v0,0($v1) L4: addiu $a0,$a0,1 L5: bne $v0,$a3,LOOP EXIT: for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; ; X[j].r == i ; &X[j+1].r ; X[j+1].r ; j++ ; X[j+1].r == 0xF L1! L2 ! L3 ! L5 ! L1 Superscalar is issue-limited! 2 cycles/iteration sustained 72 Critical Path with Prediction

Loads are not speculative for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; 73 Prediction + Load Speculation ~4 cycles! Load not pipelined ack edge (self-anti-dependence) for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; 74 OOO Pipe Snapshot LOOP: L1: beq $v0,$a1,EXIT L2: addiu $v1,$v1,20 L3: lw $v0,0($v1)

L4: addiu $a0,$a0,1 L5: bne $v0,$a3,LOOP EXIT: register renaming ; X[j].r == i ; &X[j+1].r ; X[j+1].r ; j++ ; X[j+1].r == 0xF IF DA EX WB CT L5 L1

L2 L1 L2 L3 L4 L1 L3 L5 L3 L2 L1 L3 L3 75 Unrolling? for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j+=2) { if (X[j].r == i) break;

if (X[j+1].r == 0xF) break; when 1 if (X[j+1].r == i) break; } Y[i] = X[j].q; } back back to talk iteration 76 Ideal Architecture Low ILP computation + OS + VM CPU ASH

High-ILP computation Memory back 77

Recently Viewed Presentations

  • Biology Honors Pre AP 3rd and 4th Nine

    Biology Honors Pre AP 3rd and 4th Nine

    Must be colorful using more than 4 colors (black white and gray do not count) ... but the wolf population was driven to. extinction by humans. After many years wolves were reintroduced to the park, how do ... 1. describe...
  • Persia, the Greeks and Alexander

    Persia, the Greeks and Alexander

    Creates the League of Corinth. Elected to lead the Greek army against Persia. Assassinated. Sarissa - 13-20 feet long. Alexander III of Macedon "The Great"July 356 - June 323 BCE. 337 BCE - Became king on his father's death. Bloody...
  • Automating Fortran - C Interoperability

    Automating Fortran - C Interoperability

    The AST stands for abstract syntax tree. It means a tree representation of the abstract syntactic structure of source code written in a programming language. And it's widely used in compilers as intermediate representation. Because the tree structure only preserves...
  • Phys 102 - Lecture 2

    Phys 102 - Lecture 2

    Phys. 102, Lecture 22, Slide . Need two (or more) waves. Must have same wavelength. Must be . coherent. Use one light source with waves taking different paths: Two slits. Two different refractive indices. Reflection off of two different surfaces...
  • Place Value Relationships

    Place Value Relationships

    In the number 3,554, does having a 3 in the thousands place and a 4 in the ones place change the relationship between the values of the two 5's? No, the 5 in the hundreds place is still 10 times...
  • Instructions for the Analogy Essay

    Instructions for the Analogy Essay

    Some analogies are worth a million dollars! John Searle, a modern philosopher, was asked, "Will a computer ever think like a human being?" His answer, the Chinese Room Analogy, is worth over a $1,000,000!
  • Upper Air Soundings How to understand and use

    Upper Air Soundings How to understand and use

    The convective condensation level (CCL) is the height to which a parcel of air, if heated sufficiently from below, will rise adiabatically until it is just saturated. Usually, it is the height of the base of cumuliform clouds produced by...
  • The Story of Ancient Greece

    The Story of Ancient Greece

    Geography of Greece. Greece is a small country in Europe. Greece is near the Mediterranean Sea. The main part of Greece in on a peninsula. A peninsula is a body of land surrounded by water on three sides.