Dive into basics of GPU, CUDA & Accelerated programming using Numba in Python

9 min readFeb 8, 2022

In this blog, I will talk about basics of GPU, CUDA and Numba. I will also briefly discuss how using Numba makes a noticable difference in day-to-day code both on CPU and GPU.

This blog is not a Numba tutorial. To learn Numba follow this — Kaggle Notebook (Even I learnt it from here). This blog will be more focused on the underlying theory.

The main motive behind the blog — Deep learning practioners often talk about GPU’s capability and which GPU one can use given the use case and usage. Being in this field it is important to understand the components of the GPU and CUDA as we use both of them every single day!

I have attached all the resources which I have referred to, to understand the below topics. Feel free to explore those links too!

Brief intro to Numba
What is CUDA ?
CUDA Kernels
Thread, Block, Grid and Wrap in CUDA
Streaming Multiprocessors
SMIT Architecture
Coalesced Memory Access
Bank conflicts
Numba on CPU and GPU (with CUDA)

1. Brief intro to Numba

Numba is just-in-time compiler (just-in-time means the code is compiled during runtime instead of compiling it before hand.) for Python best suited for functions involving numpy arrays (and their calculations), lengthy mathematical operations and loops.

When using Numba, it works only with functions, i.e, it doesn’t optimize the entire Python application or a part of code in the function. It is not a replacement for Python’s interpreter but it helps it speed up the execution. Numba accelerates the code by specifying data type to the variable we use. In Python, we don’t mention int or float. Python’s interpreter does that itself while compiling which makes it slow. So Numba assigns data type such as int32 or float32 before hand.

However, Numba cannot optimize all the code we write meaning it doesn’t work with certain data types. Refer to the documentation (Link in References).

Alternative to Numba is pyCUDA and CUDA in C/C++. But one of the main advantages of Numba is that is accelerates code for CPU also whereas other two are specific to Nvidia GPUs.

2. What is CUDA & why use it?

CUDA stands for Compute Unified Device Architecture. It is an API and is a parallel computing platform. It is specific to Nvidia’s GPU.

Wait wait, what is parallel computing platform? It is type of architecture for computing where multiple processors simultaneously execute multiple calculations which are break down of complex large problems.

GPUs, in general, are used where parallel computation is required.

What are CUDA cores? CUDA cores are nothing but number of cores in Nvidia’s GPUs (Nvidia -> CUDA). These are high-tech cores specializing in parallel computing which perform complex operations/calculations. Greater the number of cores, faster the computation!

Why do we use CUDA? For complex calculations, which can be broken down into smaller problems and each sub-problem is independent of other sub-problem’s result, here in my friend parallel computing comes into the picture. To compute such huge number of sub-problems faster and parallelly, you bring in GPUs which have way too many cores compared to a CPU. Now each sub-problem can be assumed as a individual task being computed by each core of the GPU. Since you depend on the GPU to complete the task at lightening speed, your code should also be in a form that the it expects to be in. Now how will you code an operation in a way the GPU expects it? So, CUDA is an API & programming language by Nvidia, which works on Nvidia’s own GPUs to help you run your code (say written in Python or C++) (for example convolution operation) on their GPUs efficiently. To use CUDA in our code, we install CUDA toolkit.

3. CUDA Kernels

A function which is supposed to run on GPU is called kernel. In order to designate a function as kernel we use something called as function qualifiers. These are similar to decorators in Python. Function qualifiers are placed just above the function.

Few function qualifiers are (in C/C++) —

 __global__ - functions marked with this qualifier becomes kernel. This means the kernel can be called from a host and will run on device__device__ - functions marked with this qualifier denotes that the function/kernel can only be called by a device and will run on the device (device = GPU)

How does a simple kernel looks like with Numba? Following is an example where we multiple numbers by 2

from numba import cuda
import numpy as np@cuda.jit # decorator to run function on GPU using CUDA
def multiply_kernel(x, out):
    idx = cuda.grid(1) # create 1-D grid of threads
    out[idx] = x[idx] * 2  # in each thread, save value of x[i]*2n = 4096
x = np.arange(n).astype(np.int32)
d_x = cuda.to_device(x)
d_out = cuda.device_array_like(d_x) # create output arrayblocks_per_grid= 32  # number blocks in each grid
threads_per_block = 128  # number threads in each blockmultiply_kernel[blocks_per_grid, threads_per_block](d_x, d_out)
cuda.synchronize() # Wait for GPU to complete the task
print(d_out.copy_to_host()) # copy data from GPU to CPU/Host

Output will be something like this : [0, 2, 4, 6, … , 8192]

4. Threads, Blocks, Grid and Wrap in CUDA

Threads — Threads are single execution unit that run your kernels.

Blocks — Several threads together form a Block. There are max 1024 threads in each block.

A GPU can run some number of blocks parallelly. A very good GPU can use 10 blocks to complete your task in 10mins and an average GPU could use only 2 blocks concurrently to complete your task in 20mins. The number of threads in each block remain the same, that is 1024. So this means that the same code can run on different GPUs without needed a single change in the code. Thanks to Nvidia!

Grid — Several blocks forms a Grid

Warp — To perform any task, threads require resources. Streaming Multiprocessors don’t directly assign resources to the threads individually. Instead they divide threads into groups of 32 (maximum, can be less also) called Wraps and then assign resources to execute any task. You can now say that a Block has many Wraps.

Half-warp — Only 16 threads will be executed on the SM at any given time hence we consider half-wrap as unit of memory accesses. (Applicable to architectures before Fermi)

5. Streaming Multiprocessors

Our kernels are executed by SMs or Streaming Multiprocessors. A GPU consist of several of them. These run at a lower clock rate and have small cache. It’s primary task is to execute several threads blocks parallelly. When one of it’s thread block has completed it’s execution, SM picks the next thread block serially.

From Nvidia’s documentation — When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

Each SM has — Several caches (L1 cache, Constant cache, Texture cache), Shared memory, Wrap schedulers and Execution cores for floating-point and integer operations. The shared memory on the SM is divided between thread blocks on the SM. If shared memory is not accessed properly by the threads, it leads to ‘bank conflict’ where threads queue behind each other and hence the performance drops. One SM can execute more than one block at any given time.

6. SMIT Architecture

I found the best explanation in Nvidia’s own documentation hence adding it here. As mentioned in Nvidia’s CUDA documentation (See references — 4), (quoting from section : Hardware Implementation)

“A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large amount of threads, it employs a unique architecture called SIMT (Single-Instruction, Multiple-Thread) that is described in SIMT Architecture. The instructions are pipelined, leveraging instruction-level parallelism within a single thread, as well as extensive thread-level parallelism through simultaneous hardware multithreading as detailed in Hardware Multithreading. Unlike CPU cores, they are issued in order and there is no branch prediction or speculative execution.”

7. Coalesced Memory Access

When a given set of threads read/write simultaneously from contiguous block of global memory in a single transaction, it is known as coalesced (can pronounce as : coalesd) memory access. Coalesced memory access applies only to GPUs with CUDA compute capability of 1.x and 2. Newer GPUs use more complex ways to access the global memory.

8. Bank conflicts

What are banks? — The shared memory that can be accessed in parallel are divided into modules called Banks.

In shared memory, 4 bytes is called 1 word. There are 32 banks (in modern cards) in shared memory in Nvidia GPU. Each successive word of shared memory belong to a bank. This means word 0 belongs to bank 0, word 1 belongs to bank 1 and so on. A wrap can send request to any permutation of these 32 banks. While dividing shared memory into banks, if any of the memory address occurs in 2 banks then it causes bank conflict.

If all the threads in a wrap wants to read value from a single bank then only once the value will be read and it will be shared among the threads. This is called broadcast.

9. Numba on CPU and GPU

On CPU

While estimating value of PI using Monte Carlo method, we pass the range as 10k. On CPU without numba it took 10.8ms (mean of 7 runs, 100 loops each) whereas with Numba it took 237μs (mean of 7 runs, 100 loops each). That is 45 times faster!

@jit(nopython=True)
def monte_carlo_pi(n):
    acc = 0
    for i in range(n):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / n%timeit monte_carlo_pi(10000)

This is an example code from Analytics Vidhya Numba tutorial. To accelerate your code with Numba, its best to refer the documentation and modify your functions accordingly such that it only performs complex calculations. Due to limited types supported by Numba, creating such small functions will help you fully utilize it’s potential.

On GPU

@cuda.jit
def add_kernel(x, y, out):
    idx = cuda.grid(1)
    out[idx] = x[idx] + y[idx]n = 4096
x = np.arange(n).astype(np.int32) 
y = np.ones_like(x)   d_x = cuda.to_device(x) 
d_y = cuda.to_device(y) 
d_out = cuda.device_array_like(d_x) threads_per_block = 128
blocks_per_grid = 32

This addition function on CUDA takes 100μs (mean of 7 runs, 10000 loops each) and a novice addition code for same input takes 880μs (mean of 7 runs, 1000 loops each). Same operation on CUDA is 8.8 times faster!

But the time taken by the CUDA function is only for the calculation it has done meaning here we do not consider time taken to move data from host memory to GPU memory and then moving back output from GPU memory to the host memory.
Still, I think we can use it with functions where we perform complex calculations using Numpy or basic math operations. If that function is called every single second say while inferencing a model, then it would make a significant improvement in overall performance!

Conclusion

It was fun exploring basic concepts of GPU and CUDA. Also, by getting exposed to Numba, it is clear that we can accelerate code on CPU as well. Maybe when running inference on CPU, Numba might help accelerate some part of the pipeline!

Follow the Kaggle notebook shared above to explore Numba :).
Thanks for reading.

Follow me for more such blogs! Also connect with me on LinkedIn.