Skip to content

A simple CUDA C++ vector addition project demonstrating CPU vs GPU parallel computing. Compares sequential CPU vector addition with GPU-accelerated CUDA implementation. shows how to write GPU kernels, manage memory, and compare CPU vs GPU performance.

Notifications You must be signed in to change notification settings

muk0644/cuda-cpp-vector-addition-parallel-computing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

CUDA Vector Addition – Beginner Project

A simple, beginner-friendly project demonstrating the difference between CPU and GPU computation using vector addition as an example.

Perfect for learning CUDA programming and building your GitHub portfolio! 🚀


📚 What is This Project About?

This project compares two ways of adding vectors (arrays of numbers):

🖥️ CPU Computation (Central Processing Unit)

  • The CPU is the "brain" of your computer
  • It processes tasks one at a time (sequentially)
  • Good for complex logic and decision-making
  • Slower for repetitive, parallel tasks

🎮 GPU Computation (Graphics Processing Unit)

  • The GPU was originally designed for graphics (gaming, video)
  • It has thousands of small cores that can work simultaneously
  • Processes many tasks at the same time (in parallel)
  • MUCH faster for repetitive tasks like vector addition

Why GPUs are Faster for Parallel Tasks: Imagine you need to paint 1000 identical fences:

  • CPU approach: One painter paints all 1000 fences (slow)
  • GPU approach: 1000 painters each paint 1 fence at the same time (fast!)

For vector addition, the GPU can compute C[0], C[1], C[2], ... all at once!


🎯 Concepts Demonstrated

This project teaches fundamental CUDA concepts:

Concept What It Means
CPU vs GPU Sequential processing vs parallel processing
CUDA Kernel A function that runs on the GPU (__global__)
Thread Indexing Each GPU thread has a unique ID (threadIdx.x, blockIdx.x)
GPU Memory Allocation Reserving memory on the GPU (cudaMalloc)
Host ↔ Device Copy Transferring data between CPU and GPU (cudaMemcpy)
Parallel Execution Thousands of threads running simultaneously

💡 Why This Project Matters

Vector addition is the "Hello World" of CUDA programming!

It's the simplest example that demonstrates:

  • ✅ How to write a CUDA kernel
  • ✅ How to manage GPU memory
  • ✅ How parallel execution works
  • ✅ The performance difference between CPU and GPU

Once you understand this, you can move on to more complex GPU applications like:

  • Machine learning and AI
  • Image processing
  • Scientific simulations
  • Cryptocurrency mining

📁 Project Structure

CUDA-Vector-Addition-Beginner/
├── cpu/
│   └── vector_add_cpu.cpp    # CPU version (standard C++)
├── cuda/
│   └── vector_add_gpu.cu     # GPU version (CUDA C++)
└── README.md                  # This file

File Descriptions

cpu/vector_add_cpu.cpp

  • Written in standard C++
  • Uses a simple for-loop to add vectors
  • Runs entirely on the CPU (sequential)
  • Compiled with g++ (the standard C++ compiler)

cuda/vector_add_gpu.cu

  • Written in CUDA C++ (.cu extension)
  • Uses a CUDA kernel to add vectors in parallel
  • Runs on the GPU with thousands of threads
  • Compiled with nvcc (NVIDIA CUDA Compiler)

🛠️ How to Compile and Run

Prerequisites

Before running this project, you need:

  1. For CPU version:

    • A C++ compiler like g++ (usually pre-installed on Linux/Mac, use MinGW on Windows)
  2. For GPU version:

    • An NVIDIA GPU (any CUDA-capable GPU)
    • CUDA Toolkit installed (Download here)
    • The nvcc compiler (comes with CUDA Toolkit)

CPU Version

The CPU version uses standard C++ and runs on any computer.

Compile:

g++ cpu/vector_add_cpu.cpp -o vector_add_cpu

Run:

./vector_add_cpu

On Windows:

vector_add_cpu.exe

What it does:

  • Initializes two vectors A and B
  • Adds them sequentially using a for-loop
  • Prints the results and execution time

GPU Version (CUDA)

The GPU version uses CUDA and requires an NVIDIA GPU and CUDA Toolkit.

Compile:

nvcc cuda/vector_add_gpu.cu -o vector_add_gpu

Run:

./vector_add_gpu

On Windows:

vector_add_gpu.exe

What it does:

  • Initializes two vectors A and B on the CPU
  • Copies them to the GPU
  • Launches a CUDA kernel with thousands of threads
  • Each thread adds ONE pair of elements in parallel
  • Copies the result back to the CPU
  • Prints the results and execution time

📖 Understanding the CUDA Code

Key CUDA Concepts

1. __global__ keyword

__global__ void vectorAddKernel(float* A, float* B, float* C, int N)
  • Marks a function as a CUDA kernel
  • Runs on the GPU but can be called from the CPU
  • Executed by many threads in parallel

2. Thread Indexing (threadIdx.x, blockIdx.x)

int i = blockIdx.x * blockDim.x + threadIdx.x;
  • Each thread has a unique ID
  • This ID determines which element the thread processes
  • Thread 0 handles C[0], Thread 1 handles C[1], etc.

3. cudaMalloc (Allocate GPU memory)

cudaMalloc((void**)&d_A, size);
  • Like malloc() but for GPU memory
  • Reserves space on the GPU for data

4. cudaMemcpy (Copy data between CPU and GPU)

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);  // CPU → GPU
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);  // GPU → CPU
  • Transfers data between CPU (host) and GPU (device)

5. Kernel Launch

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
  • <<<blocks, threads>>> syntax launches the kernel
  • Creates thousands of threads to run in parallel

🎓 Learning Path

If you're new to CUDA, follow these steps:

  1. Start with the CPU version – Understand basic vector addition
  2. Read the CUDA code comments – They explain every line
  3. Run both versions – Compare the execution times
  4. Experiment – Try changing the vector size (N)
  5. Modify the code – Try vector subtraction or multiplication

Next Steps:

  • Learn about shared memory optimization
  • Try 2D thread blocks for matrix operations
  • Explore CUDA libraries like cuBLAS and cuDNN

📊 Expected Output

CPU Version Output:

=== CPU Vector Addition ===
Vector size: 1000000 elements
Initializing vectors...
Performing vector addition on CPU...

Results (first 10 elements):
C[0] = A[0] + B[0] = 0 + 0 = 0
C[1] = A[1] + B[1] = 1 + 2 = 3
C[2] = A[2] + B[2] = 2 + 4 = 6
...

CPU execution time: 2.5 ms

GPU Version Output:

=== GPU Vector Addition using CUDA ===
Vector size: 1000000 elements
Initializing vectors on CPU...
Allocating memory on GPU...
Copying data from CPU to GPU...
Launching kernel with 3907 blocks and 256 threads per block...
Total threads: 1000192
Copying result from GPU to CPU...

Results (first 10 elements):
C[0] = A[0] + B[0] = 0 + 0 = 0
C[1] = A[1] + B[1] = 1 + 2 = 3
C[2] = A[2] + B[2] = 2 + 4 = 6
...

GPU execution time: 0.5 ms

Note: GPU execution time includes memory transfer overhead. For very large arrays, the GPU speedup becomes much more significant!


⚠️ Important Notes

Compilation

  • g++ is for CPU programs (standard C++)

    • Available on all platforms
    • No special hardware required
  • nvcc is for CUDA programs (GPU programs)

    • Comes with NVIDIA CUDA Toolkit
    • Requires an NVIDIA GPU
    • .cu file extension indicates CUDA code

System Requirements

For CPU version:

  • Any computer with a C++ compiler

For GPU version:

  • NVIDIA GPU (CUDA-capable)
  • CUDA Toolkit installed
  • Operating System: Windows, Linux, or macOS (with NVIDIA GPU)

Troubleshooting

"nvcc: command not found"

  • CUDA Toolkit is not installed or not in your PATH
  • Install CUDA Toolkit and add it to your system PATH

"no CUDA-capable device detected"

  • You don't have an NVIDIA GPU
  • Your GPU drivers are not installed
  • Your GPU doesn't support CUDA

Slow GPU performance

  • Normal for small arrays (memory transfer overhead)
  • Try increasing N to 10,000,000 to see real speedup

🎯 Target Audience

This project is designed for:

  • Absolute beginners in CUDA programming
  • ✅ Students learning parallel computing
  • ✅ Developers building a GPU programming portfolio
  • ✅ Anyone curious about GPU acceleration

No prior CUDA experience required! Just basic C/C++ knowledge.


📝 Code Style

This project follows best practices for educational code:

  • Very simple – No advanced features
  • 💬 Well commented – Every line explained
  • 📖 Clear naming – Variables like h_A (host) and d_A (device)
  • 🎯 Focused – One concept at a time
  • 🧹 Clean – Proper memory management

🚀 Future Enhancements

Want to extend this project? Try:

  • Add error checking for CUDA API calls
  • Implement vector subtraction, multiplication, or dot product
  • Compare performance with different vector sizes
  • Add a benchmarking script
  • Create a 2D matrix addition version

📚 Resources for Learning More


📄 License

This project is open source and free to use for learning and portfolio purposes.


🤝 Contributing

Found a bug or have a suggestion? Feel free to open an issue or submit a pull request!


⭐ Show Your Support

If this project helped you learn CUDA, give it a star on GitHub! ⭐


Happy CUDA Programming! 🎉

Remember: Even the most complex GPU applications start with simple concepts like vector addition. Master this, and you're on your way to building amazing GPU-accelerated software!


🧠 Understanding CUDA Concepts in Depth

What is CUDA?

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It allows you to use your graphics card (GPU) not just for gaming, but also for:

  • 🎬 Video editing (applying filters, effects, rendering)
  • 🤖 Artificial Intelligence (training neural networks, deep learning)
  • 🔬 Scientific simulations (physics, chemistry, weather modeling)
  • 💹 Data processing (financial analysis, big data)
  • 🎨 Image processing (photo editing, computer vision)

Why use the GPU? Your GPU has thousands of small cores that can work in parallel, making it incredibly fast for repetitive tasks. Instead of one CPU core doing 1000 tasks sequentially, 1000 GPU threads can do them simultaneously!


How CUDA Works: The Big Picture

CUDA is both:

  • A programming model: How to write GPU-parallel code
  • A toolkit: Software that lets your computer understand and run that code on the GPU

Think of it like this:

  • Your CPU is like a skilled manager who handles complex decisions
  • Your GPU is like a huge team of workers who can all do simple tasks at the same time

CUDA Components

CUDA includes several components that work together:

Component What It Does
Driver Low-level software that lets your computer talk to the GPU
Toolkit Collection of tools including compiler, debugger, libraries
nvcc Compiler Special compiler that turns CUDA C++ code (.cu files) into GPU machine code
Libraries Pre-built functions for common GPU tasks (math, deep learning, etc.)
Numba (Python) Tool to write GPU code in Python without learning C++

Two ways to use CUDA:

  1. C++ + nvcc: Full control and maximum performance (what this project uses)
  2. Python + Numba: Easy GPU programming without learning C++ (uses @jit decorator)

The CUDA Execution Process

When you run a CUDA program, here's what happens step by step:

1. CPU starts the program
   ↓
2. CPU allocates memory on both CPU (host) and GPU (device)
   ↓
3. CPU sends data to the GPU (e.g., vectors A and B)
   ↓
4. CPU launches the kernel (tells GPU: "Run this function!")
   ↓
5. GPU executes the kernel with thousands of threads in parallel
   ↓
6. GPU sends results back to CPU (e.g., vector C)
   ↓
7. Program frees all memory and terminates

Real-world example: Editing a 4K video frame

  • CPU loads the video file
  • CPU sends one frame to GPU
  • GPU applies a filter (sharpen, color correction) super fast
  • GPU sends the filtered frame back
  • Repeat for all frames

Understanding Threads in CUDA

What is a thread? A thread is a single instance of your kernel function running on the GPU. When you launch a kernel, thousands of threads run the same code but on different data.

Thread Hierarchy: CUDA organizes threads in a 3-dimensional structure:

Grid (entire computation)
  └── Blocks (groups of threads)
        └── Threads (individual workers)

Why 3D? This is convenient for naturally multi-dimensional problems:

  • 1D: Processing an array (like our vector addition)
  • 2D: Processing an image (width × height)
  • 3D: Processing video (width × height × frames)

CUDA Built-in Variables

Every CUDA thread has access to special variables that tell it "who it is":

Variable What It Means Range
threadIdx.x Thread ID within its block 0 to (blockDim.x - 1)
blockIdx.x Block ID in the grid 0 to (gridDim.x - 1)
blockDim.x Total threads per block Set by programmer

These also exist for .y and .z dimensions!


Computing the Global Thread ID

The most important calculation in CUDA is finding each thread's global ID:

int thread_id_x = blockIdx.x * blockDim.x + threadIdx.x;

What this means:

  • blockIdx.x: Which block am I in?
  • blockDim.x: How many threads per block?
  • threadIdx.x: Which thread am I within my block?
  • Result: My unique position in the entire computation

Bakery analogy: Imagine a bakery with 4 ovens (blocks), each with 8 trays (threads):

  • To find tray #26 in the whole bakery:
  • It's in oven #3 (blockIdx.x = 3)
  • It's tray #2 inside that oven (threadIdx.x = 2)
  • Formula: 3 × 8 + 2 = 26

Theater seat analogy: You're in seat #26 at a theater with 4 rows (blocks) of 8 seats (threads):

  • Row 3 (blockIdx.x = 3)
  • Seat 2 in that row (threadIdx.x = 2)
  • Global seat: 3 × 8 + 2 = 26

Visualizing Thread Indexing

Let's say we have 4 blocks, each with 8 threads (total = 32 threads):

Global ID:     [0][1][2][3][4][5][6][7] [8][9][10]...[26]...[31]
                └─── Block 0 ───┘ └─── Block 1 ───┘ ... └Block 3┘

threadIdx.x:   [0][1][2][3][4][5][6][7] [0][1][2]...[2]...[7]
                                                      ↑
                                                  Thread 26

blockIdx.x:    [    Block 0    ] [    Block 1    ] ... [  Block 3  ]
                                                         ↑
                                                     Thread 26 is here

For thread with global ID 26:

  • blockIdx.x = 3 (it's in block 3)
  • threadIdx.x = 2 (it's thread #2 within that block)
  • blockDim.x = 8 (each block has 8 threads)
  • Calculation: 3 × 8 + 2 = 26

This means: Thread 26 is responsible for processing array[26] in our vector!


Parallelism in CUDA: What's Automatic?

The GPU automatically handles:

  • Thread scheduling: Assigning threads to physical cores
  • Warp execution: Groups of 32 threads run together
  • Block scheduling: Distributing blocks across streaming multiprocessors
  • Core mapping: Deciding which thread runs on which core

You control:

  • 🎯 How many threads and blocks to launch
  • 🎯 What each thread should compute
  • 🎯 How threads access memory

Example: How Our Vector Addition Uses Threads

In our vector_add_gpu.cu code:

// We launch the kernel with 256 threads per block
int threadsPerBlock = 256;
int blocksPerGrid = (N + 255) / 256;

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

Inside the kernel:

int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
    C[i] = A[i] + B[i];  // Each thread adds ONE element
}

For N = 1,000,000 elements:

  • threadsPerBlock = 256
  • blocksPerGrid = (1,000,000 + 255) / 256 = 3,907 blocks
  • Total threads launched: 3,907 × 256 = 1,000,192 threads
  • Each thread computes one element (threads beyond N do nothing due to the if check)

This means:

  • Thread 0 computes C[0] = A[0] + B[0]
  • Thread 1 computes C[1] = A[1] + B[1]
  • Thread 26 computes C[26] = A[26] + B[26]
  • ...all happening at the same time! ⚡

Why This Makes GPUs So Fast

CPU approach (sequential):

Time = N iterations (one at a time)
For 1,000,000 elements = 1,000,000 time steps

GPU approach (parallel):

Time = (N / number_of_cores) iterations
For 1,000,000 elements with 1000 cores = 1,000 time steps

Speedup = 1000× faster! 🚀


Key Takeaways

  1. CUDA lets you use your GPU for any computation, not just graphics
  2. Threads run in parallel on thousands of GPU cores
  3. Each thread has a unique ID calculated by blockIdx.x * blockDim.x + threadIdx.x
  4. You write one kernel function, but it runs thousands of times simultaneously
  5. The GPU handles all the scheduling automatically – you just define the logic
  6. Memory management is explicit – you control what goes to GPU and when

Thread Blocks and Grids

How CUDA organizes threads:

CUDA doesn't just run thousands of independent threads. It structures threads into blocks, and blocks into grids.

Grid (2D or 3D structure)
  └── Thread Blocks (groups of threads)
        └── Threads (individual workers)

Key properties:

Property Description
Thread blocks are independent Each block can run in any order
Threads in a block run on the same SM All threads in one block execute on one Streaming Multiprocessor (GPU core)
Threads in a block can communicate They share memory and can synchronize
Blocks build a grid Multiple blocks form the complete computation

Theater stage analogy: Think of a theater with multiple stages (SMs):

  • Actors = blocks
  • Stages = GPU cores (SMs)
  • Actors are assigned to stages when they're free
  • More space = more actors performing at once
  • The order doesn't matter!

2D and 3D Thread Indexing

For problems that are naturally multi-dimensional (like images or 3D models), CUDA provides 2D and 3D indexing.

CUDA provides the dim3 structure:

typedef struct {
   int x; int y; int z;
} dim3;

Example: Processing a 512×512 image

// Grid configuration: 16×16 blocks
dim3 gridDim(16, 16, 1);     // 16×16 = 256 blocks total

// Block configuration: 32×32 threads per block
dim3 blockDim(32, 32, 1);    // 32×32 = 1024 threads per block

// Calculate global 2D thread position
int th_x = blockIdx.x * blockDim.x + threadIdx.x;
int th_y = blockIdx.y * blockDim.y + threadIdx.y;

// Now th_x and th_y represent pixel coordinates in the image!

Why calculate global thread ID? Because each thread needs to know which part of data it should work on!

Built-in CUDA variables:

Variable What It Means
threadIdx Thread ID within its block (x, y, z)
blockIdx Block ID within the grid (x, y, z)
blockDim Size (number of threads) of a block
gridDim Size (number of blocks) of the grid

How the GPU Schedules Blocks

Dynamic scheduling:

When you launch a CUDA kernel, you don't control exactly when or where each block runs. The GPU hardware does this automatically!

The process:

  1. You define: Kernel function, number of blocks, threads per block
  2. CUDA assigns: Blocks to SMs (Streaming Multiprocessors)
  3. CUDA assigns: Threads to CUDA cores
  4. No guarantee of order: Blocks can execute in any order
  5. Dynamic allocation: If an SM has more resources, it will run more blocks

Key insight: You don't assign threads to cores manually — CUDA does it automatically! This is what makes CUDA so powerful and easy to use.


Understanding Warps and the Warp Scheduler

What is a warp?

A warp is a group of 32 consecutive threads that execute together. This is a fundamental hardware concept in CUDA.

How warps work:

Block of 256 threads
  ├── Warp 0: Threads 0-31
  ├── Warp 1: Threads 32-63
  ├── Warp 2: Threads 64-95
  ...
  └── Warp 7: Threads 224-255

Warp properties:

  • ✅ All threads in a warp belong to the same block
  • ✅ Threads are placed in warps sequentially
  • ✅ All threads in a warp execute the same instruction at the same time (SIMT: Single Instruction, Multiple Threads)
  • ✅ Only one warp executes per SM at any given time
  • ✅ The GPU implements zero-overhead warp scheduling (switching between warps is free!)

What is a Warp Scheduler?

The warp scheduler is hardware inside each SM that decides which warp should run next.

How it works:

  1. After a block is assigned to an SM, the SM splits it into warps
  2. The warp scheduler picks which warp to run based on:
    • Data availability
    • Instruction readiness
    • Priority policy
  3. Eligible warps (whose operands are ready) are selected for execution
  4. This happens automatically — you don't control it!

Student class analogy: Think of a warp like 32 students in class all solving the same math problem at the same time. They all follow the same steps, but each works on their own numbers. The teacher (warp scheduler) decides which group of 32 students to help next.


Serial vs Parallel Execution: The Big Difference

Let's see the difference between CPU (serial) and GPU (parallel) with vector addition:

CPU Approach (Serial):

// CPU code: Sequential execution
void vec_add_cpu(int size, float* a, float* b, float* result) {
    for (int i = 0; i < size; i++) {
        result[i] = a[i] + b[i];  // One at a time
    }
}

Timeline:

Time 0: result[0] = a[0] + b[0]
Time 1: result[1] = a[1] + b[1]
Time 2: result[2] = a[2] + b[2]
...
Time N: result[N] = a[N] + b[N]

GPU Approach (Parallel):

// GPU kernel: Parallel execution
__global__ void vec_add_gpu(int size, float* a, float* b, float* result) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < size) {
        result[i] = a[i] + b[i];  // All at once!
    }
}

Timeline:

Time 0: ALL results computed simultaneously!
  Thread 0: result[0] = a[0] + b[0]
  Thread 1: result[1] = a[1] + b[1]
  Thread 2: result[2] = a[2] + b[2]
  ...
  Thread N: result[N] = a[N] + b[N]

The difference:

  • CPU: N iterations (sequential) = N time steps
  • GPU: 1 iteration (parallel) = 1 time step (ignoring hardware limits)

Students analogy: Imagine 6 students solving 6 pairs of math problems:

  • CPU: One student does all 6 problems (slow)
  • GPU: Each student does 1 problem simultaneously (fast!)

What is a Kernel Function?

A kernel is a GPU function that:

  • ✅ Runs on the GPU
  • ✅ Is called from CPU code
  • ✅ Executes with thousands of threads in parallel

Key properties:

Property Description
No return value Kernels can't return values directly
Output via arrays Results must be written to arrays passed as parameters
Declare thread hierarchy You specify blocks and threads when calling
Asynchronous execution CPU continues immediately without waiting for GPU

Kernel definition (C++):

__global__ void myKernel(float* input, float* output, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        output[i] = input[i] * 2.0f;  // Example operation
    }
}

Kernel invocation (launch):

int threadsPerBlock = 256;
int blocksPerGrid = (N + 255) / 256;

// Launch the kernel
myKernel<<<blocksPerGrid, threadsPerBlock>>>(d_input, d_output, N);

What happens when you launch a kernel:

  1. CPU sends command to GPU: "Run this kernel"
  2. GPU creates blocks and threads according to your specification
  3. CPU continues immediately (asynchronous!)
  4. GPU executes kernel in parallel
  5. CPU can check later if GPU finished (synchronization)

Real-world example: You want to apply a filter to 1,000 photos:

  • Write a kernel that processes one photo
  • Launch kernel: "Use 100 blocks, 10 threads each"
  • GPU applies filter to all 1,000 photos at the same time!

CUDA Memory Hierarchy

CUDA provides different types of memory, each with different speed, size, and scope.

Memory types and their properties:

Memory Type Location Scope Lifetime Speed Size
Register On-chip (SM) Single thread Thread ⚡⚡⚡ Fastest Very small
Local Off-chip (DRAM) Single thread Thread 🐌 Slow Medium
Shared On-chip (SM) All threads in block Block ⚡⚡ Very fast Small (~48KB)
Global Off-chip (DRAM) All threads in grid Application 🐌 Slowest Large (GBs)
Constant Off-chip (cached) All threads in grid Application ⚡ Fast (cached) Small (~64KB)

Variable type qualifiers in CUDA:

__global__ void myKernel() {
    // Automatic variables → Registers (fastest!)
    int threadLocal = threadIdx.x;
    
    // Shared memory → Fast, shared within block
    __shared__ float sharedData[256];
    
    // Device memory → Slow, but large
    // Passed as pointer from host
}

// Global memory → Accessible by all
__device__ float globalVar;

// Constant memory → Read-only, cached
__constant__ float constVar;

Memory hierarchy visualization:

Thread
  └── Registers (private, fastest)
  └── Local Memory (private, slow - overflow from registers)

Block
  └── Shared Memory (shared within block, very fast)

Grid
  └── Global Memory (shared everywhere, large but slow)
  └── Constant Memory (read-only, cached, fast)

Performance tips:

  • Use registers for thread-local variables (automatic)
  • Use shared memory for data shared within a block (manual)
  • ⚠️ Minimize global memory access (it's slow!)
  • Use constant memory for read-only data (automatically cached)

Software vs Hardware: Understanding the Layers

It's crucial to understand that grid, block, thread are software concepts, while GPU, SM, CUDA core are hardware components.

Software Layer (Your Code):

Concept What You Define
Grid Collection of blocks you launch
Block Group of threads (e.g., 256 threads)
Thread Single instance of kernel function

Hardware Layer (Physical GPU):

Component Physical Hardware
GPU The entire graphics card
SM (Streaming Multiprocessor) A GPU core that executes blocks
CUDA Core Tiny compute unit that runs one thread at a time

Are they equivalent? NO!

  • ❌ A grid ≠ a GPU
  • ❌ A block ≠ an SM
  • ❌ A thread ≠ a CUDA core

How they cooperate:

  1. You write code defining grids, blocks, and threads
  2. GPU hardware assigns blocks to SMs dynamically
  3. Each SM breaks blocks into warps (32 threads each)
  4. Warp scheduler assigns warps to CUDA cores
  5. CUDA cores execute individual threads

Example with real numbers:

Your code:
- Grid: 1000 blocks
- Block size: 256 threads

Your hardware (e.g., RTX 3080):
- GPU: 1 device
- SMs: 68 streaming multiprocessors
- CUDA cores per SM: 128

What happens:
- GPU assigns multiple blocks to each SM
- Each SM runs blocks one (or more) at a time
- SM splits each block into 8 warps (256/32)
- Warp scheduler runs warps on CUDA cores
- Result: Massive parallelism!

Key insight: The hardware automatically handles all the mapping and scheduling. You just define the logic and structure — CUDA does the rest!


Putting It All Together: The Complete Picture

Let's trace what happens when you run our vector addition program:

1. CPU Code (Host):

vectorAddKernel<<<3907, 256>>>(d_A, d_B, d_C, N);

2. CUDA Creates Structure:

  • Grid: 3,907 blocks
  • Block size: 256 threads per block
  • Total threads: 3,907 × 256 = 1,000,192 threads

3. GPU Hardware Assignment:

  • GPU has (for example) 68 SMs
  • Each SM gets multiple blocks assigned
  • Blocks can run on any available SM

4. Each SM Processes Its Blocks:

  • SM splits each block into warps: 256 threads ÷ 32 = 8 warps
  • Warp scheduler decides which warp runs next
  • CUDA cores execute threads

5. Each Thread Computes:

int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
    C[i] = A[i] + B[i];
}

6. Memory Access:

  • Each thread reads from global memory: A[i], B[i]
  • Each thread computes: sum
  • Each thread writes to global memory: C[i]

7. Synchronization:

  • All threads complete
  • GPU signals CPU: "I'm done!"
  • CPU copies result back from GPU

This entire process happens in milliseconds with massive parallelism! 🚀


From Theory to Practice

Now that you understand these concepts, look at our cuda/vector_add_gpu.cu code again:

  • Can you identify where we calculate the global thread ID?
  • Can you see how each thread processes exactly one element?
  • Can you understand why we need the if (i < N) check?
  • Can you see how blocks are organized on the grid?
  • Do you understand what happens to the thread blocks on the SMs?

Try experimenting:

  • Change threadsPerBlock from 256 to 128 or 512 (notice how it affects warps!)
  • Modify the kernel to do vector subtraction instead
  • Add timing code to measure GPU speedup
  • Try processing 2D data (like images) using 2D blocks
  • Experiment with shared memory for block-level communication

The best way to learn CUDA is by experimenting! 💪

About

A simple CUDA C++ vector addition project demonstrating CPU vs GPU parallel computing. Compares sequential CPU vector addition with GPU-accelerated CUDA implementation. shows how to write GPU kernels, manage memory, and compare CPU vs GPU performance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published