aboutblogreadrunczigcpugpumlos

Taxonomy


SIMT

alt text

GPGPU


GPGPU

  • OpenCL
  • CUDA (NVIDIA)
  • ROCm (AMD)
FeatureCUDA (NVIDIA)OpenCL (on NVIDIA)
SupportNVIDIA GPUs onlyNVIDIA, AMD, Intel, CPUs, FPGAs, etc
PerformanceGenerally higher, more optimizedGood, but usually lower than CUDA

OpenCL


OpenCL is a standard for cross-platform parallel programming of diverse accelerators. As such it is not CUDA specific, but targets CPUs, GPUs, DSPs and FPGAs. The OpenCL API specification enables each chip to have its own OpenCL drivers tuned to its specific architecture. To be conformant to OpenCL, hardware vendors must become OpenCL Adapters and submit their conformance test results for review. NVIDIA is OpenCL conformant, which allow us to use the C API provided by OpenCL to directly call the NVIDIA runtime. This means that learning OpenCL will translate to other platforms as well, since they have to conform to the OpenCL specification if they are conformant.

CUDA (NVIDIA)


Compute Unified Device Architecture (CUDA).

SM -> Warp -> Block -> Grid -> Kernel

Terminology


Description
Streaming Multiprocessors (SM)
WarpA group of 32 threads.
Warp schedulerA dedicated scheduler for optimizing the execution flow of warps.
Thread group
CUDA core
Tensor core
Special function unit
Load/store unit

Streaming Multiprocessor (SM)


Streaming Multiprocessors (SMs) is the fundamental processing unit within NVIDIA GPUs (See Figure 2 for an overview of NVIDIA's H100 GPU SM). SMs are roughly analogous to CPU cores. Both execute computations and store state. However, compared to CPU cores, GPU SMs are simple and weaker processors. Execution in SMs is pipelined within an instruction (as in most CPUs since the 1990s), but there is no speculative execution or instruction pointer prediction (as in most CPUs these days).

GPU SMs can execute more threads in parallel than CPUs. The H100 GPU can execute more than 16 000 threads (132 SMs x 32 threads x 4 warps > 16 000), while e.g., the AMD EPYC 9965 uses a total of 384 threads. However, most consumer-grade CPUs are far from even that amount of threads.

GPU SMs support a large number of concurrent threads, i.e., threads that can be interleaved. A single SM on the H100 can concurrently execute up to 2048 threads split across 64 thread groups of 32 threads each. That means that the whole H100 with its 132 SMs can execute a total of 250k concurrent threads.

alt text

Brief


  • The host is the CPU available in the system. The system memory associated with the CPU is called host memory.
  • The GPU is called a device and GPU memory likewise called device memory.

To execute any CUDA program, there are three main steps:

  • Copy the input data from host memory to device memory, also known as host-to-device transfer.
  • Load the GPU program and execute, caching data on-chip for performance.
  • Copy the results from device memory to host memory, also called device-to-host transfer.

Every CUDA kernel starts with a global declaration specifier. Programmers provide a unique global ID to each thread by using built-in variables.

A group of threads is called a CUDA block. CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2).

Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other SMs in GPU (except during preemption, debugging, or CUDA dynamic parallelism). One SM can run several concurrent CUDA blocks depending on the resources needed by CUDA blocks. Each kernel is executed on one device and CUDA supports running multiple kernels on a device at one time.

The CUDA programming model provides three key language extensions to programmers:

  • CUDA blocks—A collection or group of threads.
  • Shared memory—Memory shared within a block among all threads.
  • Synchronization barriers— Enable multiple threads to wait until all threads have reached a particular point of execution before any thread continues.

There are also many third-party tool-chains available:

  • PyCUDA — Use CUDA API operations from a Python interface.
  • OpenCL —Use low-level API operations to program CUDA GPUs.

Three step processing in heterogeneous architecture:

  1. CPU - Bridge - CPU memory
  2. PCIe or NVLink Bus
  3. GigaThread - Interconnect - L2 - DRAM

Steps:

  1. We have to copy from the CPU to the GPU.
  2. Do the work, running CUDA kernels.
  3. Copy results back to CPU memory from GPU memory.
__global__ void mykernel(void) {

}

__global__ indicates that a function:

  1. Runs on the device
  2. Is called from the host code (can also be called from other device code)

nvcc separates source code into host and device components:

  1. Device functions (e.g., mykernel()) processed by NVIDIA compiler
  2. Host functions (e.g., main()) processed by standard host compiler.

We need a way to call this on the GPU. Kernel launch or triple angle brackets will mark a call to device code:

mykernel<<<1,1>>>();

The parameters inside the brackets are the CUDA kernel execution configuration.

What about memory management? Host and device memory are separate entities. Device pointers point to GPU memory.

  1. Typically passed to device code.
  2. Typically not dereferenced in host code.

Host pointers point to CPU memory:

  1. Typically not passed to device code.
  2. Typically not dereferenced in device code.

(Special cases: Pinned pointers, ATS, managed memory)

Simple CUDA API for handling device memory:

cudaMalloc(), cudaFree(), cudaMemcpy() // similar to C equivalents

CUDA programming model


The CUDA programming model allows the programmer to define C++ functions, called kernels, which are executed N times in parallel by N different CUDA threads when they are invoked (depending on configuration). A kernel is defined using the __global__ declaration specifier like so:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}
int main()
{
    ...
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);
    ...
}

Using the <<<...>>> syntax we can specify the execution configuration. In the above example we specify that the kernel VecAdd should be executed on N different CUDA threads.

Core


CUDA core


Tensor core


Units


Special function unit


Load/store unit


Warp


Memory


Memory coalescing


Drivers (archlinux)


Run

$ lspci -k -d ::03xx
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
	Subsystem: ASUSTeK Computer Inc. Device 8599
	Kernel driver in use: nouveau
	Kernel modules: nouveau, nvidia_drm, nvidia

Look for your name:

https://nouveau.freedesktop.org/CodeNames.html

Install correct package according to: https://wiki.archlinux.org/title/NVIDIAhttps://wiki.archlinux.org/title/NVIDIA#Installation

Restart machine

nvidia-smi

Install cuda (also see https://docs.nvidia.com/deploy/cuda-compatibility/)

pacman -S cuda

Verify installation

nvcc --version
  • nvidia
  • nvidia-utils: provides nvidia-libgl, opengl-driver, vulkan-driver
  • cuda
  • opencl-nvidia

Interacting with CUDA


There exists several wrappers of the CUDA API.

  1. PyCUDA: https://github.com/inducer/pycuda
    Lets you access NVIDIA's CUDA parallel computation API from Python.
  2. Official CUDA API bindings: https://github.com/NVIDIA/cuda-python
    NVIDIA's own language bindings for the CUDA API written in Python/Cython. This is similar to (1) in abstraction level.
  3. Numba: https://github.com/numba/numba
    NumPy-aware optimizing compiler for Python, which means it uses the LLVM compiler project to generate machine code directly from Python syntax. Numba provides a high-level interface and uses cuda-python (2) under the hood to accelerate computations when compiling for CUDA.

Numba


Numba is an open source, NumPy-aware optimizing compiler for Python sponsored by Anaconda, Inc. It uses the LLVM compiler project to generate machine code from Python syntax. Numba can compile a large subset of numerically-focused Python, including many NumPy functions. Additionally, Numba has support for automatic parallelization of loops, generation of GPU-accelerated code, and creation of ufuncs and C callbacks.

ROCm (AMD)


ROCm (Radeon Open Compute) is AMD's open-source parallel computing architecture and framework.

HIP


HIP (Heterogeneous Interface for Portability) is AMD's dedicated GPU programming environment.

Software stacks


  • Tinygrad
  • Pytorch
  • Jax

Tinygrad


Tinygrad takes advantage of the fact that every tensor operation is either elementwise, or reduction. The obvious advantage is to build up abstraction that makes optimization easier (think of this in the context of Complex Instruction Set Architecture vs RISC).

Pytorch


Jax


https://github.com/jax-ml/jax

JAX is a Python library for accelerator-oriented array computation and program transformation, designed for high-performance numerical computing and large-scale machine learning.

JAX can automatically differentiate native Python and NumPy functions. It can differentiate through loops, branches, recursion, and closures, and it can take derivatives of derivatives of derivatives. It supports reverse-mode differentiation (a.k.a. backpropagation) via jax.grad as well as forward-mode differentiation, and the two can be composed arbitrarily to any order.

JAX uses XLA to compile and scale your NumPy programs on TPUs, GPUs, and other hardware accelerators.

Tensor abstraction


A Tensor is a multi-dimensional matrix containing elements of a single data type.

Define


  • SM
  • CUDA programming model
  • GPUs are 1024 bit SIMD machines?

Resources


Top


Other