EECS 152 Computer Architecture and Engineering

EECS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 3 - Pipelining Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152 Last Time in Lecture 2 Microcoding, an effective technique to manage control unit complexity, invented in era when logic (tubes), main memory (magnetic core), and ROM (diodes) used different technologies Difference between ROM and RAM speed motivated additional complex instructions Technology advances leading to fast SRAM made technology assumptions invalid Complex instructions sets impede parallel and pipelined implementations Load/store, register-rich ISAs (pioneered by Cray, popularized by RISC) perform better in new VLSI technology 2 Analyzing Microcoded Machines John Cocke and group at IBM Working on a simple pipelined processor, 801, and advanced compilers

inside IBM Ported experimental PL.8 compiler to IBM 370, and only used simple register-register and load/store instructions similar to 801 Code ran faster than other existing compilers that used all 370 instructions! (up to 6MIPS whereas 2MIPS considered good before) Emer, Clark, at DEC Measured VAX-11/780 using external hardware Found it was actually a 0.5MIPS machine, although usually assumed to be a 1MIPS machine Found 20% of VAX instructions responsible for 60% of microcode, but only account for 0.2% of execution time! VAX8800 Control Store: 16K*147b RAM, Unified Cache: 64K*8b RAM 4.5x more microstore RAM than cache RAM! 3 Iron Law of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle Instructions per program depends on source code, compiler technology, and ISA

Cycles per instructions (CPI) depends on ISA and architecture Time per cycle depends upon the architecture and base technology 4 CPI for Microcoded Machine 7 cycles Inst 1 5 cycles Inst 2 10 cycles Inst 3 Time Total clock cycles = 7+5+10 = 22 Total instructions = 3 CPI = 22/3 = 7.33 CPI is always an average over a large number of instructions. 5

IC Technology Changes Tradeoffs Logic, RAM, ROM all implemented using MOS transistors Semiconductor RAM ~ same speed as ROM 6 Reconsidering Microcode Machine (Nanocoded 68000 example) ! C S I R C P code r Exploits recurring control PC (state) e s next-state signal patterns in code,

U e e.g., address h c a code C ROM . t ALU0 A Reg[rs1] s e n I d nanoaddress ... o c e ALUI0 A Reg[rs1] D nanoinstruction ROM

d ... e r i data w d r a H Motorola 68000 had 17-bit code containing either 10-bit jump or 9-bit nanoinstruction pointer Nanoinstructions were 68 bits wide, decoded to give 196 control signals 7 From CISC to RISC Use fast RAM to build fast instruction cache of user-visible instructions, not fixed hardware microroutines Contents of fast instruction memory change to fit application needs Use simple ISA to enable hardwired pipelined implementation Most compiled code only used few CISC instructions Simpler encoding allowed pipelined implementations

Further benefit with integration In early 80s, finally fit 32-bit datapath + small caches on single chip No chip crossings in common case allows faster operation 8 Berkeley RISC Chips RISC-I (1982) Contains 44,420 transistors, fabbed in 5 m NMOS, with a die area of 77 mm2, ran at 1 MHz. This chip is probably the first VLSI RISC. RISC-II (1983) contains 40,760 transistors, was fabbed in 3 m NMOS, ran at 3 MHz, and the size is 60 mm2. Stanford built some too 9 Microprogramming is far from extinct Played a crucial role in micros of the Eighties DEC uVAX, Motorola 68K series, Intel 286/386 Plays an assisting role in most modern micros

e.g., AMD Zen, Intel Sky Lake, Intel Atom, IBM PowerPC, Most instructions executed directly, i.e., with hard-wired control Infrequently-used and/or complicated instructions invoke microcode Patchable microcode common for post-fabrication bug fixes, e.g. Intel processors load code patches at bootup Intel had to scramble to resurrect microcode tools and find original microcode engineers to patch Meltdown/Spectre security vulnerabilites 10 Iron Law of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle Instructions per program depends on source code, compiler technology, and ISA Cycles per instructions (CPI) depends on ISA and architecture Time per cycle depends upon the architecture and base technology Microarchitecture Microcoded Single-cycle unpipelined Pipelined

CPI >1 1 1 cycle time short long short 11 Classic 5-Stage RISC Pipeline Decode EXecute Memory Store Writeback ALU B

Data Cache A Registers Instruction Cache Inst. Register PC Imm Fetch This version designed for regfiles/memories with synchronous reads and writes. 12 CPI Examples Microcoded machine 7 cycles 5 cycles Inst 1

10 cycles Inst 2 Time Inst 3 3 instructions, 22 cycles, CPI=7.33 Unpipelined machine Inst 1 Inst 2 Inst 3 3 instructions, 3 cycles, CPI=1 Pipelined machine Inst 1 Inst 2 Inst 3 3 instructions, 3 cycles, CPI=1 5-stage pipeline CPI5!!! 13 Instructions interact with each other in pipeline

An instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard An instruction may depend on something produced by an earlier instruction Dependence may be for a data value data hazard Dependence may be for the next instructions address control hazard (branches, exceptions) Handling hazards generally introduces bubbles into pipeline and reduces ideal CPI > 1 14 Pipeline CPI Examples Measure from when first instruction finishes to when last instruction in sequence finishes. Time Inst 1 Inst 2 Inst 3 Inst 1 Inst 2 Bubble Inst 3

3 instructions finish in 3 cycles CPI = 3/3 =1 3 instructions finish in 4 cycles CPI = 4/3 = 1.33 Inst 1 Bubble 1 Inst 2 Inst Bubble 3 2 Inst 3 3 instructions finish in 5cycles CPI = 5/3 = 1.67 15 Resolving Structural Hazards Structural hazard occurs when two instructions need same hardware resource at same time Can resolve in hardware by stalling newer instruction till older instruction finished with resource A structural hazard can always be avoided by adding more hardware to design

E.g., if two instructions both need a port to memory at same time, could avoid hazard by adding second port to memory Classic RISC 5-stage integer pipeline has no structural hazards by design Many RISC implementations have structural hazards on multicycle units such as multipliers, dividers, floating-point units, etc., and can have on register writeback ports 16 Types of Data Hazards Consider executing a sequence of register-register instructions of type: rk ri op rj Data-dependence r3 r1 op r2 Read-after-Write r5 r3 op r4 (RAW) hazard Anti-dependence r3 r1 op r2 Write-after-Read r1 r4 op r5 (WAR) hazard Output-dependence r3 r1 op r2 Write-after-Write r3 r6 op r7 (WAW) hazard 17

Three Strategies for Data Hazards Interlock Wait for hazard to clear by holding dependent instruction in issue stage Bypass Resolve hazard earlier by bypassing value as soon as available Speculate Guess on value, correct if wrong 18 Interlocking Versus Bypassing add x1, x3, x5 sub x2, x1, x4 F F add x1, x3, x5 D X

M W F D X M W F D X M W F D X M W

F D X D X M W F D X bubble Instruction interlocked in decode stage bubble bubble M W

sub x2, x1, x4 add x1, x3, x5 M W sub x2, x1, x4 Bypass around ALU with no bubbles 19 Example Bypass Path Decode EXecute Memory Store Writeback ALU B

Data Cache A Registers Instruction Cache Inst. Register PC Imm Fetch 20 Fully Bypassed Data Path Fetch Decode Memory

F Store Writeback Data Cache B A ALU Registers Instruction Cache Inst. Register PC Imm EXecute

D X M W F D X M W F D X M W F D

X M W [ Assumes data written to registers in a W cycle is readable in parallel D cycle (dotted line). Extra write data register and bypass paths required if this is not possible. ] 21 Value Speculation for RAW Data Hazards Rather than wait for value, can guess value! So far, only effective in certain limited cases: Branch prediction Stack pointer updates Memory address disambiguation 22 CS152 Administrivia PS 1 is posted PS 1 is due at start of class on Monday Feb 11 Lab 1 out on Friday Lab 1 overview in Section Friday, 1-2pm DIS 101 3113 Etcheverry

2-3pm DIS 102 3107 Etcheverry 23 CS252 Administrivia CS252 discussions grading policy Well ignore your two lowest scores in grading, which includes absences Send in summary even if you cant attend discussion CS252 Piazza class has been created Sign up for this as well as CS152 Piazza Each CS252 paper has dedicated thread Post your response as private note to instructors Due 6AM Monday before Monday discussion section 24 Control Hazards What do we need to calculate next PC? For Unconditional Jumps Opcode, PC, and offset For Jump Register Opcode, Register value, and offset For Conditional Branches

Opcode, Register (for condition), PC and offset For all other instructions Opcode and PC ( and have to know its not one of above ) 25 Control flow information in pipeline Fetch Decode EXecute Opcode, offset known Store ALU B Data Cache A Registers

Inst. Register PC Instruction Cache Writeback Branch condition, Jump register value known Imm PC known Memory 26 RISC-V Unconditional PC-Relative Jumps Jump? [ Kill bit turns instruction into a bubble ]

Add Fetch Decode ALU B Registers Instruction Cache Inst. Register PC_fetch Kill Imm +4 A

FKill PC_decode PCJumpSel EXecute 27 Pipelining for Unconditional PC-Relative Jumps F j target D X M W F D X

M W F D X bubble M W target: add x1, x2, x3 28 Branch Delay Slots Early RISCs adopted idea from pipelined microcode engines, and changed ISA semantics so instruction after branch/jump is always executed before control flow change occurs: 0x100 j target 0x104 add x1, x2, x3 // Executed before target 0x205 target: xori x1, x1, 7 Software has to fill delay slot with useful work, or fill with

explicit NOP instruction F j target D X M W F D X M W F D X add x1, x2, x3

M W target: xori x1, x1, 7 29 Post-1990 RISC ISAs dont have delay slots Encodes microarchitectural detail into ISA c.f. IBM 650 drum layout Performance issues Increased I-cache misses from NOPs in unused delay slots I-cache miss on delay slot causes machine to wait, even if delay slot is a NOP Complicates more advanced microarchitectures Consider 30-stage pipeline with four-instruction-per-cycle issue Better branch prediction reduced need Branch prediction in later lecture 30 RISC-V Conditional Branches Fetch Inst.

Decode ALU B Registers Instruction Cache Inst. Register PC_fetch Kill Kill +4 Cond? Add PC_execute

DKill A PC_decode FKill Branch? Add PCSel EXecute 31 Pipelining for Conditional Branches F beq x1, x2, target D X

M W F D X M W F D X M W F D X bubble bubble

M W target: add x1, x2, x3 32 Pipelining for Jump Register Register value obtained in execute stage F jr x1 D X M W F D X M W F

D X M W F D X bubble bubble M W target: add x5, x6, x7 33 Why instruction may not be dispatched every cycle in classic 5-stage pipeline (CPI>1) Full bypassing may be too expensive to implement typically all frequently used paths are provided some infrequently used bypass paths may increase cycle time

and counteract the benefit of reducing CPI Loads have two-cycle latency Instruction after load cannot use load result MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). Removed in MIPS-II (pipeline interlocks added in hardware) MIPS:Microprocessor without Interlocked Pipeline Stages Jumps/Conditional branches may cause bubbles kill following instruction(s) if no delay slots Machines with software-visible delay slots may execute significant number of NOP instructions inserted by the compiler. NOPs reduce CPI, but increase instructions/program! 34 Traps and Interrupts In class, well use following terminology Exception: An unusual internal event caused by program during execution E.g., page fault, arithmetic underflow

Interrupt: An external event outside of running program Trap: Forced transfer of control to supervisor caused by exception or interrupt Not all exceptions cause traps (c.f. IEEE 754 floating-point standard) 35 History of Exception Handling Analytical Engine had overflow exceptions First system with traps was Univac-I, 1951 Arithmetic overflow would either 1. trigger the execution a two-instruction fix-up routine at address 0, or 2. at the programmer's option, cause the computer to stop Later Univac 1103, 1955, modified to add external interrupts Used to gather real-time wind tunnel data First system with I/O interrupts was DYSEAC, 1954 Had two program counters, and I/O signal caused switch between two PCs Also, first system with DMA (Direct Memory Access by I/O device) And, first mobile computer! 36

DYSEAC, first mobile computer! Carried in two tractor trailers, 12 tons + 8 tons Built for US Army Signal Corps [Courtesy Mark Smotherman] 37 Asynchronous Interrupts An I/O device requests attention by asserting one of the prioritized interrupt request lines When the processor decides to process the interrupt It stops the current program at instruction I , completing i all the instructions up to Ii-1 (precise interrupt) It saves the PC of instruction I in a special register (EPC) i It disables interrupts and transfers control to a designated interrupt handler running in supervisor mode 38 Trap: altering the normal flow of control Ii-1

program HI1 Ii HI2 Ii+1 HIn trap handler An external or internal event that needs to be processed by another (system) program. The event is usually unexpected or rare from programs point of view. 39 Trap Handler Saves EPC before enabling interrupts to allow nested interrupts need an instruction to move EPC into GPRs need a way to mask further interrupts at least until EPC can be saved

Needs to read a status register that indicates the cause of the trap Uses a special indirect jump instruction ERET (return-from-environment) which enables interrupts restores the processor to the user mode restores hardware status and control state 40 Synchronous Trap A synchronous trap is caused by an exception on a particular instruction In general, the instruction cannot be completed and needs to be restarted after the exception has been handled requires undoing the effect of one or more partially executed instructions In the case of a system call trap, the instruction is considered to have been completed a special jump instruction involving a change to a privileged mode 41 Exception Handling 5-Stage Pipeline

PC Inst. Mem PC address Exception D Decode Illegal Opcode E + M Overflow Data Mem W Data address Exceptions

Asynchronous Interrupts How to handle multiple simultaneous exceptions in different pipeline stages? How and where to handle external asynchronous interrupts? 42 Exception Handling 5-Stage Pipeline Commit Point Select Handler PC Kill F Stage E Illegal Opcode +

M Overflow Data Mem Data address Exceptions Exc D Exc E Exc M PC D PC E PC M Asynchronous

Kill D Stage Kill E Stage W Cause PC address Exception D Decode EPC PC Inst. Mem Interrupts Kill

Writeback 43 Exception Handling 5-Stage Pipeline Hold exception flags in pipeline until commit point (M stage) Exceptions in earlier pipe stages override later exceptions for a given instruction Inject external interrupts at commit point (override others) If trap at commit: update Cause and EPC registers, kill all stages, inject handler PC into fetch stage 44 Speculating on Exceptions Prediction mechanism Exceptions are rare, so simply predicting no exceptions is very accurate! Check prediction mechanism Exceptions detected at end of instruction execution pipeline, special hardware for various exception types Recovery mechanism Only write architectural state at commit point, so can throw away partially executed instructions after exception

Launch exception handler after flushing pipeline Bypassing allows use of uncommitted instruction results by following instructions 45 Acknowledgements This course is partly inspired by previous MIT 6.823 and Berkeley CS252 computer architecture courses created by my collaborators and colleagues: Arvind (MIT) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) 46

Recently Viewed Presentations

  • Unit 3 Review

    Unit 3 Review

    Carly. 12. 5. Brenna. 10. 7 #3. Dogs were chosen as the favorite pet by 7 of the 21 students in our class. Determine the fractional equivalent (in the simplest form) for the number of students in our class who...
  • Tailoring Tabletop Interfaces for Musical Control - TCD

    Tailoring Tabletop Interfaces for Musical Control - TCD

    Tailoring Tabletop Interfaces for Musical Control. Computer-based interactive music system (IMS): Control of complex musical output, timbre. Interface detached from the sound generator. Disrupts the player's perceived relationship with the system. Liam O'Sullivan, Trinity College Dublin.
  • PowerPoint-Präsentation

    PowerPoint-Präsentation

    In LinkedTV, URIs are created to identify programs which can link to different broadcasts (live, replay, on demand) and locators. Annotations of programs are shared accross all possible deliveries of broadcasts. Each instance can be attached to different locators (except...
  • MA Wraparound Fidelity Assessment System:

    MA Wraparound Fidelity Assessment System:

    Wraparound fidelity, as measured by the MA Wraparound Fidelity Assessment System, is defined as the degree to which intensive care coordination teams adhere to the principles of quality wraparound and carry out the basic activities of facilitating a wraparound process....
  • Relationship Jeopardy - Manchester University

    Relationship Jeopardy - Manchester University

    The game ends when all selections are taken, or when I call time. There are two Daily Doubles and a Final Jeopardy, with the wagers being made before the question is asked with the minimum wager being 100 and the...
  • U.S. General Services Administration Easy Ordering with GSA

    U.S. General Services Administration Easy Ordering with GSA

    GSA was established by President Harry Truman on July 1, 1949, to streamline the administrative work of the federal government. GSA consolidated a variety of different functions, including the National Archives Establishment, the Federal Works Agency, the Public Buildings Administration,...
  •  The The molecular molecular of of Embryonic Embryonic

    The The molecular molecular of of Embryonic Embryonic

    التركيب الجزيئي لمركز نيوكب The Molecular Biology of Nieuwkoop center التمايز الخلوي ودور المنظمات الجنينية 1-عند حقن mRNA للنوقين في الفلجات البطنية الخضرية لجنين الضفدعة فإنة يكون مركز نيوكب ويحفز تكوين ...
  • Presentación de PowerPoint

    Presentación de PowerPoint

    These improvements in development and adult success have implications for public expenditures resulting in cost savings in education, social services, the criminal justice system, and health care. Of course, it is not just the government cost savings that are important,...