COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 2 Instructions: Language of the Computer Effect of Language and Algorithm 2 Effect of Language and Algorithm Bubblesort Relative Performance 3 2.5 2 1.5 1 0.5 0 C/none C/O1 C/O2 C/O3
Java/int Java/JIT Java/int Java/JIT Quicksort Relative Performance 2.5 2 1.5 1 0.5 0 C/none C/O1 C/O2 C/O3 Quicksort vs. Bubblesort Speedup 3000
2500 2000 1500 1000 500 0 C/none C/O1 C/O2 C/O3 Java/int Java/JIT 3 Lessons Learnt Instruction count and CPI are not good
performance indicators in isolation Compiler optimizations are sensitive to the algorithm Java/JIT compiled code is significantly faster than JVM interpreted Comparable to optimized C in some cases Nothing can fix a dumb algorithm! 4 Arrays vs. Pointers Array indexing involves Multiplying index by element size Adding to array base address Pointers correspond directly to memory
addresses Can avoid indexing complexity 5 Example: Clearing an Array clear1(int array[], int size) { int i; for (i = 0; i < size; i += 1) array[i] = 0; } MOV X9,XZR // i = 0 loop1: LSL X10,X9,#3 // X10 = i * 8 ADD X11,X0,X10 // X11 = address // of array[i] STUR XZR,[X11,#0] // array[i] = 0 ADDI X9,X9,#1 // i = i + 1 CMP X9,X1 // compare i to // size B.LT loop1 // if (i < size) // go to loop1
clear2(int *array, int size) { int *p; for (p = &array[0]; p < &array[size]; p = p + 1) *p = 0; } MOV X9,X0 // p = address of // array[0] LSL X10,X1,#3 // X10 = size * 8 ADD X11,X0,X10 // X11 = address // of array[size] loop2: STUR XZR,0[X9,#0] // Memory[p] = 0 ADDI X9,X9,#8 // p = p + 8 CMP X9,X11 // compare p to < // &array[size] B.LT loop2 // if (p < // &array[size]) // go to loop2 6 Comparison of Array vs. Ptr
Multiply strength reduced to shift (compiler optimization) Array version requires shift to be inside loop Part of index calculation for incremented i c.f. incrementing pointer Compiler can achieve same effect as manual use of pointers Induction variable elimination (eliminating array address calculations within loops) Better to make program clearer and safer 7 ARM & MIPS Similarities
ARM: the most popular embedded core Similar basic set of instructions to MIPS ARM MIPS 1985 1985 Instruction size 32 bits 32 bits Address space 32-bit flat 32-bit flat Data alignment Aligned Aligned
9 3 15 32-bit 31 32-bit Memory mapped Memory mapped Date announced Data addressing modes Registers Input/output 8 Instruction Encoding 9 The Intel x86 ISA
Evolution with backward compatibility 8080 (1974): 8-bit microprocessor 8086 (1978): 16-bit extension to 8080 Adds FP instructions and register stack 80286 (1982): 24-bit addresses, MMU Complex instruction set (CISC) 8087 (1980): floating-point coprocessor
Accumulator, plus 3 index-register pairs Segmented memory mapping and protection 80386 (1985): 32-bit extension (now IA-32) Additional addressing modes and operations Paged memory mapping as well as segments 10 The Intel x86 ISA Further evolution i486 (1989): pipelined, on-chip caches and FPU Pentium (1993): superscalar, 64-bit datapath
New microarchitecture (see Colwell, The Pentium Chronicles) Pentium III (1999) Later versions added MMX (Multi-Media eXtension) instructions The infamous FDIV bug Pentium Pro (1995), Pentium II (1997) Compatible competitors: AMD, Cyrix, Added SSE (Streaming SIMD Extensions) and associated registers Pentium 4 (2001)
New microarchitecture Added SSE2 instructions 11 The Intel x86 ISA And further AMD64 (2003): extended architecture to 64 bits EM64T Extended Memory 64 Technology (2004) Intel Core (2006) Intel declined to follow, instead Advanced Vector Extension (announced 2008)
Added SSE4 instructions, virtual machine support AMD64 (announced 2007): SSE5 instructions AMD64 adopted by Intel (with refinements) Added SSE3 instructions Longer SSE registers, more instructions If Intel didnt extend with compatibility, its competitors would! Technical elegance market success 12 Basic x86 Registers 13 Basic x86 Addressing Modes
Two operands per instruction Source/dest operand Second source operand Register Register Register Immediate Register Memory Memory Register Memory Immediate
Memory addressing modes Address in register Address = Rbase + displacement Address = Rbase + 2scale Rindex (scale = 0, 1, 2, or 3) Address = Rbase + 2scale Rindex + displacement 14 X86 Integer operations The 8086 provides support for both 8-bit (byte) and 16-bit (word) data types. The 80386 adds 32-bit addresses and data (doublewords) in the x86. The x86 integer operations can be divided into four major
classes: 1. Data movement instructions, including move, push, and pop. 2. Arithmetic and logic instructions, including test, integer, and decimal arithmetic operations. 3. Control flow, including conditional branches, unconditional branches, calls, and returns. 4. String instructions, including string move and string compare. 15 X86 Integer operations Some typical x86 instructions and their functions: 16 X86 Integer operations
Conditional branches on the x86 are based on condition codes or flags, like ARMv7. some of the integer x86 instructions: 17 x86 Instruction Encoding Variable length encoding (1 15 bytes) Postfix bytes specify addressing mode Prefix bytes modify operation
Operand length, repetition, locking, The opcode may include the addressing mode and the register A postbyte labeled mod, reg, r/m, contains the addressing mode 18 information. Implementing IA-32 Complex instruction set makes implementation difficult Hardware translates instructions to simpler microoperations
Simple instructions: 11 Complex instructions: 1many Microengine similar to RISC Market share makes this economically viable Comparable performance to RISC Compilers avoid complex instructions 19 Fallacies Powerful instruction higher performance Fewer instructions required But complex instructions are hard to implement
May slow down all instructions, including simple ones Compilers are good at making fast code from simple instructions Use assembly code for high performance But modern compilers are better at dealing with modern processors More lines of code more errors and less productivity Dangers of writing in assembly language are the protracted time spent coding and debugging, the loss in portability, and the difficulty of maintaining such code. 20 Fallacies Backward compatibility instruction set
doesnt change But they do create more instructions x86 instruction set 21 Pitfalls Sequential words or doubleword addresses in machines with byte addressing do not differ by one Increment by 4, not by 1! Keeping a pointer to an automatic variable after procedure returns e.g., passing pointer back via an argument Pointer becomes invalid when stack popped
22 Concluding Remarks Design principles 1. Simplicity favors regularity 2. Smaller is faster (ARMv8 has 32 registers ) 3. Make the common case fast PC-relative addressing for conditional branches and immediate addressing for larger constant operands 4. Good design demands good compromises Layers of software/hardware compromise between providing for larger addresses and constants in instructions and keeping all instructions the same length
Compiler, assembler, hardware LEGv8: typical of RISC ISAs c.f. x86 23 Concluding Remarks Additional ARMv8 features: Flexible second operand Additional addressing modes Conditional instructions (e.g. CSET, CINC) 24 Concluding Remarks
Each category of ARMv8 instructions is associated with constructs that appear in programming languages The popularity of each class of instructions for SPEC CPU2006 is shown below 25