CUDA - Louisiana Tech University

CUDA - Louisiana Tech University

Intermediate GPGPU Programming in CUDA Supada Laosooksathit NVIDIA Hardware Architecture Host

memory Recall 5 steps for CUDA Programming Initialize device Allocate device memory Copy data to device memory

Execute kernel Copy data back from device memory Initialize Device Calls To select the device associated to the host thread cudaSetDevice(device) This function must be called before any __global__

function, otherwise device 0 is automatically selected. To get number of devices cudaGetDeviceCount(&devicecount) To retrieve devices property cudaGetDeviceProperties(&deviceProp, device)

Hello World Example Allocate host and device memory Hello World Example Host code

Hello World Example Kernel code To Try CUDA Programming SSH to 138.47.102.111 Set environment vals in .bashrc in your home directory export PATH=$PATH:/usr/local/cuda/bin

export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK Compile the following directories NVIDIA_GPU_Computing_SDK/shared/

NVIDIA_GPU_Computing_SDK/C/common/ The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/ Demo Hello World Print out block and thread IDs

Vector Add C=A+B NVIDIA Hardware Architecture SM

Specifications of a Device Specifications Compute Capability 1.3

Compute Capability 2.0 Warp size 32

32 Max threads/block 512 1024

Max Blocks/grid 65535 65535

Shared mem 16 KB/SM 48 KB/SM For more details

deviceQuery in CUDA SDK Appendix F in Programming Guide 4.0 Demo deviceQuery Show hardware specifications in details

Memory Optimizations Reduce the time of memory transfer between host and device Use asynchronous memory transfer (CUDA streams) Use zero copy Reduce the number of transactions between

on-chip and off-chip memory Memory coalescing Avoid bank conflicts in shared memory Reduce Time of Host-Device Memory Transfer Regular memory transfer (synchronously)

Reduce Time of Host-Device Memory Transfer CUDA streams Allow overlapping between kernel and memory copy CUDA Streams Example

CUDA Streams Example GPU Timers CUDA Events An API Use the clock shade in kernel Accurate for timing kernel executions

CUDA timer calls Libraries implemented in CUDA SDK CUDA Events Example Demo

simpleStreams Reduce Time of Host-Device Memory Transfer Zero copy Allow device pointers to access page-locked host memory directly Page-locked host memory is allocated by

cudaHostAlloc() Demo Zero copy Reduce number of On-chip and Off-chip Memory Transactions

Threads in a warp access global memory Memory coalescing Copy a bunch of words at the same time Memory Coalescing Threads in a warp access global memory in a straight forward way (4-byte word per thread)

Memory Coalescing Memory addresses are aligned in the same segment but the accesses are not sequential Memory Coalescing Memory addresses are not aligned in the

same segment Shared Memory 16 banks for compute capability 1.x, 32 banks for compute capability 2.x Help utilizing memory coalescing Bank conflicts may occur

Two or more threads in access the same bank In compute capability 1.x, no broadcast In compute capability 2.x, broadcast the same data to many threads that request Bank Conflicts No bank conflict

2-way bank conflict Threads: Banks: 0 0

Threads: Banks: 0 0 1 1

1 1 2

2 2 2 3

3 3 3

Matrix Multiplication Example Matrix Multiplication Example Reduce accesses to global memory (A.height/BLOCK_SIZE) times reading A (B.width/BLOCK_SIZE) times reading B

Demo Matrix Multiplication With and without shared memory Different block sizes Control Flow if, switch, do, for, while

Branch divergence in a warp Threads in a warp issue different instruction sets Different execution paths will be serialized Increase number of instructions in that warp Branch Divergence

Summary 5 steps for CUDA Programming NVIDIA Hardware Architecture Memory hierarchy: global memory, shared memory, register file Specifications of a device: block, warp, thread, SM

Summary Memory optimization Reduce overhead due to host-device memory transfer with CUDA streams, Zero copy Reduce the number of transactions between onchip and off-chip memory by utilizing memory coalescing (shared memory)

Try to avoid bank conflicts in shared memory Control flow Try to avoid branch divergence in a warp References http://docs.nvidia.com/cuda/cuda-c-program

ming-guide/ http://docs.nvidia.com/cuda/cuda-c-best-prac tices-guide/ http://www.developer.nvidia.com/cuda-toolki t

Recently Viewed Presentations

  • Awal Daur Hidup (Early Life Cycle)

    Awal Daur Hidup (Early Life Cycle)

    TelurdanBagian-bagiannya. Telurhewan vertebrata berdasarkankepadajumlahdeutoplasma (kuningtelur, dsb) dibagidua: A) Telur. H. omolecithal (Isolecithal). Inihanyaada ...
  • Health & Safety Policy - Phoenix HSC

    Health & Safety Policy - Phoenix HSC

    Health & Safety Policy HSWA Section 2(3) 5 or more employees (written!) Statement of general policy Organisation Arrangements for implementation Revise where appropriate Bring statement and any revision of it to the notice of all his employees Statement of General...
  • Seminar on Self Emulsifying Drug Delivery System

    Seminar on Self Emulsifying Drug Delivery System

    SOLID SELF EMULSIFYING DRUG DELIVERY SYSTEMS. SEDDS are usually limited to liquid dosage forms because many excipients used in SEDDS are not solids at room temperature. They are frequently more effective alternatives to conventional liquid SEDDS.
  • 1+1 = On the Polysemy of Symbols

    1+1 = On the Polysemy of Symbols

    Polysemy is a form of lexical ambiguity. The same word has different meanings; meaning is context dependent. Gray and Tall (1994) advocated for flexible interpretation of symbols such as 5+4 as processes or concepts, i.e. ... + ∞ as addition...
  • Researcher affiliation extraction from homepages

    Researcher affiliation extraction from homepages

    Korpuszok és adatbázisok Korpuszok a nyelvészeti kutatásban - 2018. szeptember 14. Angol nyelvű korpuszok British National Corpus (BNC) Brit angol ~100M szövegszó Írott és beszélt nyelv Automatikus annotáció Wall Street Journal (WSJ) Üzleti nyelv Egyes részei kézzel annotálva (morfológia, szintaxis)...
  • Figurative Language

    Figurative Language

    Definition. Figurative language is the inclusive term for words that are used in ways that depart conspicuously from their literal meanings in order to achieve special meanings or effects.. Figurative language is used most often in poetry, but it is...
  • Pre-licensure Advising

    Pre-licensure Advising

    Pre-licensure Advising/Information. Dr. Mary Ellen Wilkosz. Professor & Chair. Dr. Rachel Napoli. Assistant Professor and Assistant Director Pre-Licensure Program
  • Issues in Spoken Dialogue Systems Julia Hirschberg LSA07

    Issues in Spoken Dialogue Systems Julia Hirschberg LSA07

    Issues in Spoken Dialogue Systems Julia Hirschberg LSA07 353