A simple, beginner-friendly project demonstrating the difference between CPU and GPU computation using vector addition as an example.
Perfect for learning CUDA programming and building your GitHub portfolio! 🚀
This project compares two ways of adding vectors (arrays of numbers):
- The CPU is the "brain" of your computer
- It processes tasks one at a time (sequentially)
- Good for complex logic and decision-making
- Slower for repetitive, parallel tasks
- The GPU was originally designed for graphics (gaming, video)
- It has thousands of small cores that can work simultaneously
- Processes many tasks at the same time (in parallel)
- MUCH faster for repetitive tasks like vector addition
Why GPUs are Faster for Parallel Tasks: Imagine you need to paint 1000 identical fences:
- CPU approach: One painter paints all 1000 fences (slow)
- GPU approach: 1000 painters each paint 1 fence at the same time (fast!)
For vector addition, the GPU can compute C[0], C[1], C[2], ... all at once!
This project teaches fundamental CUDA concepts:
| Concept | What It Means |
|---|---|
| CPU vs GPU | Sequential processing vs parallel processing |
| CUDA Kernel | A function that runs on the GPU (__global__) |
| Thread Indexing | Each GPU thread has a unique ID (threadIdx.x, blockIdx.x) |
| GPU Memory Allocation | Reserving memory on the GPU (cudaMalloc) |
| Host ↔ Device Copy | Transferring data between CPU and GPU (cudaMemcpy) |
| Parallel Execution | Thousands of threads running simultaneously |
Vector addition is the "Hello World" of CUDA programming!
It's the simplest example that demonstrates:
- ✅ How to write a CUDA kernel
- ✅ How to manage GPU memory
- ✅ How parallel execution works
- ✅ The performance difference between CPU and GPU
Once you understand this, you can move on to more complex GPU applications like:
- Machine learning and AI
- Image processing
- Scientific simulations
- Cryptocurrency mining
CUDA-Vector-Addition-Beginner/
├── cpu/
│ └── vector_add_cpu.cpp # CPU version (standard C++)
├── cuda/
│ └── vector_add_gpu.cu # GPU version (CUDA C++)
└── README.md # This file
cpu/vector_add_cpu.cpp
- Written in standard C++
- Uses a simple for-loop to add vectors
- Runs entirely on the CPU (sequential)
- Compiled with
g++(the standard C++ compiler)
cuda/vector_add_gpu.cu
- Written in CUDA C++ (
.cuextension) - Uses a CUDA kernel to add vectors in parallel
- Runs on the GPU with thousands of threads
- Compiled with
nvcc(NVIDIA CUDA Compiler)
Before running this project, you need:
-
For CPU version:
- A C++ compiler like
g++(usually pre-installed on Linux/Mac, use MinGW on Windows)
- A C++ compiler like
-
For GPU version:
- An NVIDIA GPU (any CUDA-capable GPU)
- CUDA Toolkit installed (Download here)
- The
nvcccompiler (comes with CUDA Toolkit)
The CPU version uses standard C++ and runs on any computer.
Compile:
g++ cpu/vector_add_cpu.cpp -o vector_add_cpuRun:
./vector_add_cpuOn Windows:
vector_add_cpu.exeWhat it does:
- Initializes two vectors A and B
- Adds them sequentially using a for-loop
- Prints the results and execution time
The GPU version uses CUDA and requires an NVIDIA GPU and CUDA Toolkit.
Compile:
nvcc cuda/vector_add_gpu.cu -o vector_add_gpuRun:
./vector_add_gpuOn Windows:
vector_add_gpu.exeWhat it does:
- Initializes two vectors A and B on the CPU
- Copies them to the GPU
- Launches a CUDA kernel with thousands of threads
- Each thread adds ONE pair of elements in parallel
- Copies the result back to the CPU
- Prints the results and execution time
1. __global__ keyword
__global__ void vectorAddKernel(float* A, float* B, float* C, int N)- Marks a function as a CUDA kernel
- Runs on the GPU but can be called from the CPU
- Executed by many threads in parallel
2. Thread Indexing (threadIdx.x, blockIdx.x)
int i = blockIdx.x * blockDim.x + threadIdx.x;- Each thread has a unique ID
- This ID determines which element the thread processes
- Thread 0 handles C[0], Thread 1 handles C[1], etc.
3. cudaMalloc (Allocate GPU memory)
cudaMalloc((void**)&d_A, size);- Like
malloc()but for GPU memory - Reserves space on the GPU for data
4. cudaMemcpy (Copy data between CPU and GPU)
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); // CPU → GPU
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // GPU → CPU- Transfers data between CPU (host) and GPU (device)
5. Kernel Launch
vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);<<<blocks, threads>>>syntax launches the kernel- Creates thousands of threads to run in parallel
If you're new to CUDA, follow these steps:
- ✅ Start with the CPU version – Understand basic vector addition
- ✅ Read the CUDA code comments – They explain every line
- ✅ Run both versions – Compare the execution times
- ✅ Experiment – Try changing the vector size (N)
- ✅ Modify the code – Try vector subtraction or multiplication
Next Steps:
- Learn about shared memory optimization
- Try 2D thread blocks for matrix operations
- Explore CUDA libraries like cuBLAS and cuDNN
=== CPU Vector Addition ===
Vector size: 1000000 elements
Initializing vectors...
Performing vector addition on CPU...
Results (first 10 elements):
C[0] = A[0] + B[0] = 0 + 0 = 0
C[1] = A[1] + B[1] = 1 + 2 = 3
C[2] = A[2] + B[2] = 2 + 4 = 6
...
CPU execution time: 2.5 ms
=== GPU Vector Addition using CUDA ===
Vector size: 1000000 elements
Initializing vectors on CPU...
Allocating memory on GPU...
Copying data from CPU to GPU...
Launching kernel with 3907 blocks and 256 threads per block...
Total threads: 1000192
Copying result from GPU to CPU...
Results (first 10 elements):
C[0] = A[0] + B[0] = 0 + 0 = 0
C[1] = A[1] + B[1] = 1 + 2 = 3
C[2] = A[2] + B[2] = 2 + 4 = 6
...
GPU execution time: 0.5 ms
Note: GPU execution time includes memory transfer overhead. For very large arrays, the GPU speedup becomes much more significant!
-
g++is for CPU programs (standard C++)- Available on all platforms
- No special hardware required
-
nvccis for CUDA programs (GPU programs)- Comes with NVIDIA CUDA Toolkit
- Requires an NVIDIA GPU
.cufile extension indicates CUDA code
For CPU version:
- Any computer with a C++ compiler
For GPU version:
- NVIDIA GPU (CUDA-capable)
- Check compatibility: CUDA GPUs
- CUDA Toolkit installed
- Download: CUDA Toolkit
- Operating System: Windows, Linux, or macOS (with NVIDIA GPU)
"nvcc: command not found"
- CUDA Toolkit is not installed or not in your PATH
- Install CUDA Toolkit and add it to your system PATH
"no CUDA-capable device detected"
- You don't have an NVIDIA GPU
- Your GPU drivers are not installed
- Your GPU doesn't support CUDA
Slow GPU performance
- Normal for small arrays (memory transfer overhead)
- Try increasing N to 10,000,000 to see real speedup
This project is designed for:
- ✅ Absolute beginners in CUDA programming
- ✅ Students learning parallel computing
- ✅ Developers building a GPU programming portfolio
- ✅ Anyone curious about GPU acceleration
No prior CUDA experience required! Just basic C/C++ knowledge.
This project follows best practices for educational code:
- ✨ Very simple – No advanced features
- 💬 Well commented – Every line explained
- 📖 Clear naming – Variables like
h_A(host) andd_A(device) - 🎯 Focused – One concept at a time
- 🧹 Clean – Proper memory management
Want to extend this project? Try:
- Add error checking for CUDA API calls
- Implement vector subtraction, multiplication, or dot product
- Compare performance with different vector sizes
- Add a benchmarking script
- Create a 2D matrix addition version
This project is open source and free to use for learning and portfolio purposes.
Found a bug or have a suggestion? Feel free to open an issue or submit a pull request!
If this project helped you learn CUDA, give it a star on GitHub! ⭐
Happy CUDA Programming! 🎉
Remember: Even the most complex GPU applications start with simple concepts like vector addition. Master this, and you're on your way to building amazing GPU-accelerated software!
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It allows you to use your graphics card (GPU) not just for gaming, but also for:
- 🎬 Video editing (applying filters, effects, rendering)
- 🤖 Artificial Intelligence (training neural networks, deep learning)
- 🔬 Scientific simulations (physics, chemistry, weather modeling)
- 💹 Data processing (financial analysis, big data)
- 🎨 Image processing (photo editing, computer vision)
Why use the GPU? Your GPU has thousands of small cores that can work in parallel, making it incredibly fast for repetitive tasks. Instead of one CPU core doing 1000 tasks sequentially, 1000 GPU threads can do them simultaneously!
CUDA is both:
- A programming model: How to write GPU-parallel code
- A toolkit: Software that lets your computer understand and run that code on the GPU
Think of it like this:
- Your CPU is like a skilled manager who handles complex decisions
- Your GPU is like a huge team of workers who can all do simple tasks at the same time
CUDA includes several components that work together:
| Component | What It Does |
|---|---|
| Driver | Low-level software that lets your computer talk to the GPU |
| Toolkit | Collection of tools including compiler, debugger, libraries |
| nvcc Compiler | Special compiler that turns CUDA C++ code (.cu files) into GPU machine code |
| Libraries | Pre-built functions for common GPU tasks (math, deep learning, etc.) |
| Numba (Python) | Tool to write GPU code in Python without learning C++ |
Two ways to use CUDA:
- C++ + nvcc: Full control and maximum performance (what this project uses)
- Python + Numba: Easy GPU programming without learning C++ (uses
@jitdecorator)
When you run a CUDA program, here's what happens step by step:
1. CPU starts the program
↓
2. CPU allocates memory on both CPU (host) and GPU (device)
↓
3. CPU sends data to the GPU (e.g., vectors A and B)
↓
4. CPU launches the kernel (tells GPU: "Run this function!")
↓
5. GPU executes the kernel with thousands of threads in parallel
↓
6. GPU sends results back to CPU (e.g., vector C)
↓
7. Program frees all memory and terminates
Real-world example: Editing a 4K video frame
- CPU loads the video file
- CPU sends one frame to GPU
- GPU applies a filter (sharpen, color correction) super fast
- GPU sends the filtered frame back
- Repeat for all frames
What is a thread? A thread is a single instance of your kernel function running on the GPU. When you launch a kernel, thousands of threads run the same code but on different data.
Thread Hierarchy: CUDA organizes threads in a 3-dimensional structure:
Grid (entire computation)
└── Blocks (groups of threads)
└── Threads (individual workers)
Why 3D? This is convenient for naturally multi-dimensional problems:
- 1D: Processing an array (like our vector addition)
- 2D: Processing an image (width × height)
- 3D: Processing video (width × height × frames)
Every CUDA thread has access to special variables that tell it "who it is":
| Variable | What It Means | Range |
|---|---|---|
threadIdx.x |
Thread ID within its block | 0 to (blockDim.x - 1) |
blockIdx.x |
Block ID in the grid | 0 to (gridDim.x - 1) |
blockDim.x |
Total threads per block | Set by programmer |
These also exist for .y and .z dimensions!
The most important calculation in CUDA is finding each thread's global ID:
int thread_id_x = blockIdx.x * blockDim.x + threadIdx.x;What this means:
blockIdx.x: Which block am I in?blockDim.x: How many threads per block?threadIdx.x: Which thread am I within my block?- Result: My unique position in the entire computation
Bakery analogy: Imagine a bakery with 4 ovens (blocks), each with 8 trays (threads):
- To find tray #26 in the whole bakery:
- It's in oven #3 (blockIdx.x = 3)
- It's tray #2 inside that oven (threadIdx.x = 2)
- Formula:
3 × 8 + 2 = 26✅
Theater seat analogy: You're in seat #26 at a theater with 4 rows (blocks) of 8 seats (threads):
- Row 3 (blockIdx.x = 3)
- Seat 2 in that row (threadIdx.x = 2)
- Global seat:
3 × 8 + 2 = 26✅
Let's say we have 4 blocks, each with 8 threads (total = 32 threads):
Global ID: [0][1][2][3][4][5][6][7] [8][9][10]...[26]...[31]
└─── Block 0 ───┘ └─── Block 1 ───┘ ... └Block 3┘
threadIdx.x: [0][1][2][3][4][5][6][7] [0][1][2]...[2]...[7]
↑
Thread 26
blockIdx.x: [ Block 0 ] [ Block 1 ] ... [ Block 3 ]
↑
Thread 26 is here
For thread with global ID 26:
blockIdx.x = 3(it's in block 3)threadIdx.x = 2(it's thread #2 within that block)blockDim.x = 8(each block has 8 threads)- Calculation:
3 × 8 + 2 = 26✅
This means: Thread 26 is responsible for processing array[26] in our vector!
The GPU automatically handles:
- ✅ Thread scheduling: Assigning threads to physical cores
- ✅ Warp execution: Groups of 32 threads run together
- ✅ Block scheduling: Distributing blocks across streaming multiprocessors
- ✅ Core mapping: Deciding which thread runs on which core
You control:
- 🎯 How many threads and blocks to launch
- 🎯 What each thread should compute
- 🎯 How threads access memory
In our vector_add_gpu.cu code:
// We launch the kernel with 256 threads per block
int threadsPerBlock = 256;
int blocksPerGrid = (N + 255) / 256;
vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);Inside the kernel:
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i]; // Each thread adds ONE element
}For N = 1,000,000 elements:
threadsPerBlock = 256blocksPerGrid = (1,000,000 + 255) / 256 = 3,907 blocks- Total threads launched: 3,907 × 256 = 1,000,192 threads
- Each thread computes one element (threads beyond N do nothing due to the
ifcheck)
This means:
- Thread 0 computes
C[0] = A[0] + B[0] - Thread 1 computes
C[1] = A[1] + B[1] - Thread 26 computes
C[26] = A[26] + B[26] - ...all happening at the same time! ⚡
CPU approach (sequential):
Time = N iterations (one at a time)
For 1,000,000 elements = 1,000,000 time steps
GPU approach (parallel):
Time = (N / number_of_cores) iterations
For 1,000,000 elements with 1000 cores = 1,000 time steps
Speedup = 1000× faster! 🚀
- ✅ CUDA lets you use your GPU for any computation, not just graphics
- ✅ Threads run in parallel on thousands of GPU cores
- ✅ Each thread has a unique ID calculated by
blockIdx.x * blockDim.x + threadIdx.x - ✅ You write one kernel function, but it runs thousands of times simultaneously
- ✅ The GPU handles all the scheduling automatically – you just define the logic
- ✅ Memory management is explicit – you control what goes to GPU and when
How CUDA organizes threads:
CUDA doesn't just run thousands of independent threads. It structures threads into blocks, and blocks into grids.
Grid (2D or 3D structure)
└── Thread Blocks (groups of threads)
└── Threads (individual workers)
Key properties:
| Property | Description |
|---|---|
| Thread blocks are independent | Each block can run in any order |
| Threads in a block run on the same SM | All threads in one block execute on one Streaming Multiprocessor (GPU core) |
| Threads in a block can communicate | They share memory and can synchronize |
| Blocks build a grid | Multiple blocks form the complete computation |
Theater stage analogy: Think of a theater with multiple stages (SMs):
- Actors = blocks
- Stages = GPU cores (SMs)
- Actors are assigned to stages when they're free
- More space = more actors performing at once
- The order doesn't matter!
For problems that are naturally multi-dimensional (like images or 3D models), CUDA provides 2D and 3D indexing.
CUDA provides the dim3 structure:
typedef struct {
int x; int y; int z;
} dim3;Example: Processing a 512×512 image
// Grid configuration: 16×16 blocks
dim3 gridDim(16, 16, 1); // 16×16 = 256 blocks total
// Block configuration: 32×32 threads per block
dim3 blockDim(32, 32, 1); // 32×32 = 1024 threads per block
// Calculate global 2D thread position
int th_x = blockIdx.x * blockDim.x + threadIdx.x;
int th_y = blockIdx.y * blockDim.y + threadIdx.y;
// Now th_x and th_y represent pixel coordinates in the image!Why calculate global thread ID? Because each thread needs to know which part of data it should work on!
Built-in CUDA variables:
| Variable | What It Means |
|---|---|
threadIdx |
Thread ID within its block (x, y, z) |
blockIdx |
Block ID within the grid (x, y, z) |
blockDim |
Size (number of threads) of a block |
gridDim |
Size (number of blocks) of the grid |
Dynamic scheduling:
When you launch a CUDA kernel, you don't control exactly when or where each block runs. The GPU hardware does this automatically!
The process:
- You define: Kernel function, number of blocks, threads per block
- CUDA assigns: Blocks to SMs (Streaming Multiprocessors)
- CUDA assigns: Threads to CUDA cores
- No guarantee of order: Blocks can execute in any order
- Dynamic allocation: If an SM has more resources, it will run more blocks
Key insight: You don't assign threads to cores manually — CUDA does it automatically! This is what makes CUDA so powerful and easy to use.
What is a warp?
A warp is a group of 32 consecutive threads that execute together. This is a fundamental hardware concept in CUDA.
How warps work:
Block of 256 threads
├── Warp 0: Threads 0-31
├── Warp 1: Threads 32-63
├── Warp 2: Threads 64-95
...
└── Warp 7: Threads 224-255
Warp properties:
- ✅ All threads in a warp belong to the same block
- ✅ Threads are placed in warps sequentially
- ✅ All threads in a warp execute the same instruction at the same time (SIMT: Single Instruction, Multiple Threads)
- ✅ Only one warp executes per SM at any given time
- ✅ The GPU implements zero-overhead warp scheduling (switching between warps is free!)
What is a Warp Scheduler?
The warp scheduler is hardware inside each SM that decides which warp should run next.
How it works:
- After a block is assigned to an SM, the SM splits it into warps
- The warp scheduler picks which warp to run based on:
- Data availability
- Instruction readiness
- Priority policy
- Eligible warps (whose operands are ready) are selected for execution
- This happens automatically — you don't control it!
Student class analogy: Think of a warp like 32 students in class all solving the same math problem at the same time. They all follow the same steps, but each works on their own numbers. The teacher (warp scheduler) decides which group of 32 students to help next.
Let's see the difference between CPU (serial) and GPU (parallel) with vector addition:
CPU Approach (Serial):
// CPU code: Sequential execution
void vec_add_cpu(int size, float* a, float* b, float* result) {
for (int i = 0; i < size; i++) {
result[i] = a[i] + b[i]; // One at a time
}
}Timeline:
Time 0: result[0] = a[0] + b[0]
Time 1: result[1] = a[1] + b[1]
Time 2: result[2] = a[2] + b[2]
...
Time N: result[N] = a[N] + b[N]
GPU Approach (Parallel):
// GPU kernel: Parallel execution
__global__ void vec_add_gpu(int size, float* a, float* b, float* result) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < size) {
result[i] = a[i] + b[i]; // All at once!
}
}Timeline:
Time 0: ALL results computed simultaneously!
Thread 0: result[0] = a[0] + b[0]
Thread 1: result[1] = a[1] + b[1]
Thread 2: result[2] = a[2] + b[2]
...
Thread N: result[N] = a[N] + b[N]
The difference:
- CPU: N iterations (sequential) = N time steps
- GPU: 1 iteration (parallel) = 1 time step (ignoring hardware limits)
Students analogy: Imagine 6 students solving 6 pairs of math problems:
- CPU: One student does all 6 problems (slow)
- GPU: Each student does 1 problem simultaneously (fast!)
A kernel is a GPU function that:
- ✅ Runs on the GPU
- ✅ Is called from CPU code
- ✅ Executes with thousands of threads in parallel
Key properties:
| Property | Description |
|---|---|
| No return value | Kernels can't return values directly |
| Output via arrays | Results must be written to arrays passed as parameters |
| Declare thread hierarchy | You specify blocks and threads when calling |
| Asynchronous execution | CPU continues immediately without waiting for GPU |
Kernel definition (C++):
__global__ void myKernel(float* input, float* output, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
output[i] = input[i] * 2.0f; // Example operation
}
}Kernel invocation (launch):
int threadsPerBlock = 256;
int blocksPerGrid = (N + 255) / 256;
// Launch the kernel
myKernel<<<blocksPerGrid, threadsPerBlock>>>(d_input, d_output, N);What happens when you launch a kernel:
- CPU sends command to GPU: "Run this kernel"
- GPU creates blocks and threads according to your specification
- CPU continues immediately (asynchronous!)
- GPU executes kernel in parallel
- CPU can check later if GPU finished (synchronization)
Real-world example: You want to apply a filter to 1,000 photos:
- Write a kernel that processes one photo
- Launch kernel: "Use 100 blocks, 10 threads each"
- GPU applies filter to all 1,000 photos at the same time!
CUDA provides different types of memory, each with different speed, size, and scope.
Memory types and their properties:
| Memory Type | Location | Scope | Lifetime | Speed | Size |
|---|---|---|---|---|---|
| Register | On-chip (SM) | Single thread | Thread | ⚡⚡⚡ Fastest | Very small |
| Local | Off-chip (DRAM) | Single thread | Thread | 🐌 Slow | Medium |
| Shared | On-chip (SM) | All threads in block | Block | ⚡⚡ Very fast | Small (~48KB) |
| Global | Off-chip (DRAM) | All threads in grid | Application | 🐌 Slowest | Large (GBs) |
| Constant | Off-chip (cached) | All threads in grid | Application | ⚡ Fast (cached) | Small (~64KB) |
Variable type qualifiers in CUDA:
__global__ void myKernel() {
// Automatic variables → Registers (fastest!)
int threadLocal = threadIdx.x;
// Shared memory → Fast, shared within block
__shared__ float sharedData[256];
// Device memory → Slow, but large
// Passed as pointer from host
}
// Global memory → Accessible by all
__device__ float globalVar;
// Constant memory → Read-only, cached
__constant__ float constVar;Memory hierarchy visualization:
Thread
└── Registers (private, fastest)
└── Local Memory (private, slow - overflow from registers)
Block
└── Shared Memory (shared within block, very fast)
Grid
└── Global Memory (shared everywhere, large but slow)
└── Constant Memory (read-only, cached, fast)
Performance tips:
- ✅ Use registers for thread-local variables (automatic)
- ✅ Use shared memory for data shared within a block (manual)
⚠️ Minimize global memory access (it's slow!)- ✅ Use constant memory for read-only data (automatically cached)
It's crucial to understand that grid, block, thread are software concepts, while GPU, SM, CUDA core are hardware components.
Software Layer (Your Code):
| Concept | What You Define |
|---|---|
| Grid | Collection of blocks you launch |
| Block | Group of threads (e.g., 256 threads) |
| Thread | Single instance of kernel function |
Hardware Layer (Physical GPU):
| Component | Physical Hardware |
|---|---|
| GPU | The entire graphics card |
| SM (Streaming Multiprocessor) | A GPU core that executes blocks |
| CUDA Core | Tiny compute unit that runs one thread at a time |
Are they equivalent? NO!
- ❌ A grid ≠ a GPU
- ❌ A block ≠ an SM
- ❌ A thread ≠ a CUDA core
How they cooperate:
- You write code defining grids, blocks, and threads
- GPU hardware assigns blocks to SMs dynamically
- Each SM breaks blocks into warps (32 threads each)
- Warp scheduler assigns warps to CUDA cores
- CUDA cores execute individual threads
Example with real numbers:
Your code:
- Grid: 1000 blocks
- Block size: 256 threads
Your hardware (e.g., RTX 3080):
- GPU: 1 device
- SMs: 68 streaming multiprocessors
- CUDA cores per SM: 128
What happens:
- GPU assigns multiple blocks to each SM
- Each SM runs blocks one (or more) at a time
- SM splits each block into 8 warps (256/32)
- Warp scheduler runs warps on CUDA cores
- Result: Massive parallelism!
Key insight: The hardware automatically handles all the mapping and scheduling. You just define the logic and structure — CUDA does the rest!
Let's trace what happens when you run our vector addition program:
1. CPU Code (Host):
vectorAddKernel<<<3907, 256>>>(d_A, d_B, d_C, N);2. CUDA Creates Structure:
- Grid: 3,907 blocks
- Block size: 256 threads per block
- Total threads: 3,907 × 256 = 1,000,192 threads
3. GPU Hardware Assignment:
- GPU has (for example) 68 SMs
- Each SM gets multiple blocks assigned
- Blocks can run on any available SM
4. Each SM Processes Its Blocks:
- SM splits each block into warps: 256 threads ÷ 32 = 8 warps
- Warp scheduler decides which warp runs next
- CUDA cores execute threads
5. Each Thread Computes:
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i];
}6. Memory Access:
- Each thread reads from global memory:
A[i],B[i] - Each thread computes: sum
- Each thread writes to global memory:
C[i]
7. Synchronization:
- All threads complete
- GPU signals CPU: "I'm done!"
- CPU copies result back from GPU
This entire process happens in milliseconds with massive parallelism! 🚀
Now that you understand these concepts, look at our cuda/vector_add_gpu.cu code again:
- Can you identify where we calculate the global thread ID?
- Can you see how each thread processes exactly one element?
- Can you understand why we need the
if (i < N)check? - Can you see how blocks are organized on the grid?
- Do you understand what happens to the thread blocks on the SMs?
Try experimenting:
- Change
threadsPerBlockfrom 256 to 128 or 512 (notice how it affects warps!) - Modify the kernel to do vector subtraction instead
- Add timing code to measure GPU speedup
- Try processing 2D data (like images) using 2D blocks
- Experiment with shared memory for block-level communication
The best way to learn CUDA is by experimenting! 💪