CUDA Vector Addition – Beginner Project

A simple, beginner-friendly project demonstrating the difference between CPU and GPU computation using vector addition as an example.

Perfect for learning CUDA programming and building your GitHub portfolio! 🚀

📚 What is This Project About?

This project compares two ways of adding vectors (arrays of numbers):

🖥️ CPU Computation (Central Processing Unit)

The CPU is the "brain" of your computer
It processes tasks one at a time (sequentially)
Good for complex logic and decision-making
Slower for repetitive, parallel tasks

🎮 GPU Computation (Graphics Processing Unit)

The GPU was originally designed for graphics (gaming, video)
It has thousands of small cores that can work simultaneously
Processes many tasks at the same time (in parallel)
MUCH faster for repetitive tasks like vector addition

Why GPUs are Faster for Parallel Tasks: Imagine you need to paint 1000 identical fences:

CPU approach: One painter paints all 1000 fences (slow)
GPU approach: 1000 painters each paint 1 fence at the same time (fast!)

For vector addition, the GPU can compute C[0], C[1], C[2], ... all at once!

🎯 Concepts Demonstrated

This project teaches fundamental CUDA concepts:

Concept	What It Means
CPU vs GPU	Sequential processing vs parallel processing
CUDA Kernel	A function that runs on the GPU (`__global__`)
Thread Indexing	Each GPU thread has a unique ID (`threadIdx.x`, `blockIdx.x`)
GPU Memory Allocation	Reserving memory on the GPU (`cudaMalloc`)
Host ↔ Device Copy	Transferring data between CPU and GPU (`cudaMemcpy`)
Parallel Execution	Thousands of threads running simultaneously

💡 Why This Project Matters

Vector addition is the "Hello World" of CUDA programming!

It's the simplest example that demonstrates:

✅ How to write a CUDA kernel
✅ How to manage GPU memory
✅ How parallel execution works
✅ The performance difference between CPU and GPU

Once you understand this, you can move on to more complex GPU applications like:

Machine learning and AI
Image processing
Scientific simulations
Cryptocurrency mining

📁 Project Structure

CUDA-Vector-Addition-Beginner/
├── cpu/
│   └── vector_add_cpu.cpp    # CPU version (standard C++)
├── cuda/
│   └── vector_add_gpu.cu     # GPU version (CUDA C++)
└── README.md                  # This file

File Descriptions

cpu/vector_add_cpu.cpp

Written in standard C++
Uses a simple for-loop to add vectors
Runs entirely on the CPU (sequential)
Compiled with g++ (the standard C++ compiler)

cuda/vector_add_gpu.cu

Written in CUDA C++ (.cu extension)
Uses a CUDA kernel to add vectors in parallel
Runs on the GPU with thousands of threads
Compiled with nvcc (NVIDIA CUDA Compiler)

🛠️ How to Compile and Run

Prerequisites

Before running this project, you need:

For CPU version:
- A C++ compiler like g++ (usually pre-installed on Linux/Mac, use MinGW on Windows)
For GPU version:
- An NVIDIA GPU (any CUDA-capable GPU)
- CUDA Toolkit installed (Download here)
- The nvcc compiler (comes with CUDA Toolkit)

CPU Version

The CPU version uses standard C++ and runs on any computer.

Compile:

g++ cpu/vector_add_cpu.cpp -o vector_add_cpu

Run:

./vector_add_cpu

On Windows:

vector_add_cpu.exe

What it does:

Initializes two vectors A and B
Adds them sequentially using a for-loop
Prints the results and execution time

GPU Version (CUDA)

The GPU version uses CUDA and requires an NVIDIA GPU and CUDA Toolkit.

Compile:

nvcc cuda/vector_add_gpu.cu -o vector_add_gpu

Run:

./vector_add_gpu

On Windows:

vector_add_gpu.exe

What it does:

Initializes two vectors A and B on the CPU
Copies them to the GPU
Launches a CUDA kernel with thousands of threads
Each thread adds ONE pair of elements in parallel
Copies the result back to the CPU
Prints the results and execution time

📖 Understanding the CUDA Code

Key CUDA Concepts

1. __global__ keyword

__global__ void vectorAddKernel(float* A, float* B, float* C, int N)

Marks a function as a CUDA kernel
Runs on the GPU but can be called from the CPU
Executed by many threads in parallel

2. Thread Indexing (threadIdx.x, blockIdx.x)

int i = blockIdx.x * blockDim.x + threadIdx.x;

Each thread has a unique ID
This ID determines which element the thread processes
Thread 0 handles C[0], Thread 1 handles C[1], etc.

3. cudaMalloc (Allocate GPU memory)

cudaMalloc((void**)&d_A, size);

Like malloc() but for GPU memory
Reserves space on the GPU for data

4. cudaMemcpy (Copy data between CPU and GPU)

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);  // CPU → GPU
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);  // GPU → CPU

Transfers data between CPU (host) and GPU (device)

5. Kernel Launch

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

<<<blocks, threads>>> syntax launches the kernel
Creates thousands of threads to run in parallel

🎓 Learning Path

If you're new to CUDA, follow these steps:

✅ Start with the CPU version – Understand basic vector addition
✅ Read the CUDA code comments – They explain every line
✅ Run both versions – Compare the execution times
✅ Experiment – Try changing the vector size (N)
✅ Modify the code – Try vector subtraction or multiplication

Next Steps:

Learn about shared memory optimization
Try 2D thread blocks for matrix operations
Explore CUDA libraries like cuBLAS and cuDNN

📊 Expected Output

CPU Version Output:

=== CPU Vector Addition ===
Vector size: 1000000 elements
Initializing vectors...
Performing vector addition on CPU...

Results (first 10 elements):
C[0] = A[0] + B[0] = 0 + 0 = 0
C[1] = A[1] + B[1] = 1 + 2 = 3
C[2] = A[2] + B[2] = 2 + 4 = 6
...

CPU execution time: 2.5 ms

GPU Version Output:

=== GPU Vector Addition using CUDA ===
Vector size: 1000000 elements
Initializing vectors on CPU...
Allocating memory on GPU...
Copying data from CPU to GPU...
Launching kernel with 3907 blocks and 256 threads per block...
Total threads: 1000192
Copying result from GPU to CPU...

Results (first 10 elements):
C[0] = A[0] + B[0] = 0 + 0 = 0
C[1] = A[1] + B[1] = 1 + 2 = 3
C[2] = A[2] + B[2] = 2 + 4 = 6
...

GPU execution time: 0.5 ms

Note: GPU execution time includes memory transfer overhead. For very large arrays, the GPU speedup becomes much more significant!

⚠️ Important Notes

Compilation

g++ is for CPU programs (standard C++)
- Available on all platforms
- No special hardware required
nvcc is for CUDA programs (GPU programs)
- Comes with NVIDIA CUDA Toolkit
- Requires an NVIDIA GPU
- .cu file extension indicates CUDA code

System Requirements

For CPU version:

Any computer with a C++ compiler

For GPU version:

NVIDIA GPU (CUDA-capable)
- Check compatibility: CUDA GPUs
CUDA Toolkit installed
- Download: CUDA Toolkit
Operating System: Windows, Linux, or macOS (with NVIDIA GPU)

Troubleshooting

"nvcc: command not found"

CUDA Toolkit is not installed or not in your PATH
Install CUDA Toolkit and add it to your system PATH

"no CUDA-capable device detected"

You don't have an NVIDIA GPU
Your GPU drivers are not installed
Your GPU doesn't support CUDA

Slow GPU performance

Normal for small arrays (memory transfer overhead)
Try increasing N to 10,000,000 to see real speedup

🎯 Target Audience

This project is designed for:

✅ Absolute beginners in CUDA programming
✅ Students learning parallel computing
✅ Developers building a GPU programming portfolio
✅ Anyone curious about GPU acceleration

No prior CUDA experience required! Just basic C/C++ knowledge.

📝 Code Style

This project follows best practices for educational code:

✨ Very simple – No advanced features
💬 Well commented – Every line explained
📖 Clear naming – Variables like h_A (host) and d_A (device)
🎯 Focused – One concept at a time
🧹 Clean – Proper memory management

🚀 Future Enhancements

Want to extend this project? Try:

Add error checking for CUDA API calls
Implement vector subtraction, multiplication, or dot product
Compare performance with different vector sizes
Add a benchmarking script
Create a 2D matrix addition version

📚 Resources for Learning More

📄 License

This project is open source and free to use for learning and portfolio purposes.

🤝 Contributing

Found a bug or have a suggestion? Feel free to open an issue or submit a pull request!

⭐ Show Your Support

If this project helped you learn CUDA, give it a star on GitHub! ⭐

Happy CUDA Programming! 🎉

Remember: Even the most complex GPU applications start with simple concepts like vector addition. Master this, and you're on your way to building amazing GPU-accelerated software!

🧠 Understanding CUDA Concepts in Depth

What is CUDA?

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It allows you to use your graphics card (GPU) not just for gaming, but also for:

🎬 Video editing (applying filters, effects, rendering)
🤖 Artificial Intelligence (training neural networks, deep learning)
🔬 Scientific simulations (physics, chemistry, weather modeling)
💹 Data processing (financial analysis, big data)
🎨 Image processing (photo editing, computer vision)

Why use the GPU? Your GPU has thousands of small cores that can work in parallel, making it incredibly fast for repetitive tasks. Instead of one CPU core doing 1000 tasks sequentially, 1000 GPU threads can do them simultaneously!

How CUDA Works: The Big Picture

CUDA is both:

A programming model: How to write GPU-parallel code
A toolkit: Software that lets your computer understand and run that code on the GPU

Think of it like this:

Your CPU is like a skilled manager who handles complex decisions
Your GPU is like a huge team of workers who can all do simple tasks at the same time

CUDA Components

CUDA includes several components that work together:

Component	What It Does
Driver	Low-level software that lets your computer talk to the GPU
Toolkit	Collection of tools including compiler, debugger, libraries
nvcc Compiler	Special compiler that turns CUDA C++ code (`.cu` files) into GPU machine code
Libraries	Pre-built functions for common GPU tasks (math, deep learning, etc.)
Numba (Python)	Tool to write GPU code in Python without learning C++

Two ways to use CUDA:

C++ + nvcc: Full control and maximum performance (what this project uses)
Python + Numba: Easy GPU programming without learning C++ (uses @jit decorator)

The CUDA Execution Process

When you run a CUDA program, here's what happens step by step:

1. CPU starts the program
   ↓
2. CPU allocates memory on both CPU (host) and GPU (device)
   ↓
3. CPU sends data to the GPU (e.g., vectors A and B)
   ↓
4. CPU launches the kernel (tells GPU: "Run this function!")
   ↓
5. GPU executes the kernel with thousands of threads in parallel
   ↓
6. GPU sends results back to CPU (e.g., vector C)
   ↓
7. Program frees all memory and terminates

Real-world example: Editing a 4K video frame

CPU loads the video file
CPU sends one frame to GPU
GPU applies a filter (sharpen, color correction) super fast
GPU sends the filtered frame back
Repeat for all frames

Understanding Threads in CUDA

What is a thread? A thread is a single instance of your kernel function running on the GPU. When you launch a kernel, thousands of threads run the same code but on different data.

Thread Hierarchy: CUDA organizes threads in a 3-dimensional structure:

Grid (entire computation)
  └── Blocks (groups of threads)
        └── Threads (individual workers)

Why 3D? This is convenient for naturally multi-dimensional problems:

1D: Processing an array (like our vector addition)
2D: Processing an image (width × height)
3D: Processing video (width × height × frames)

CUDA Built-in Variables

Every CUDA thread has access to special variables that tell it "who it is":

Variable	What It Means	Range
`threadIdx.x`	Thread ID within its block	0 to (blockDim.x - 1)
`blockIdx.x`	Block ID in the grid	0 to (gridDim.x - 1)
`blockDim.x`	Total threads per block	Set by programmer

These also exist for .y and .z dimensions!

Computing the Global Thread ID

The most important calculation in CUDA is finding each thread's global ID:

int thread_id_x = blockIdx.x * blockDim.x + threadIdx.x;

What this means:

blockIdx.x: Which block am I in?
blockDim.x: How many threads per block?
threadIdx.x: Which thread am I within my block?
Result: My unique position in the entire computation

Bakery analogy: Imagine a bakery with 4 ovens (blocks), each with 8 trays (threads):

To find tray #26 in the whole bakery:
It's in oven #3 (blockIdx.x = 3)
It's tray #2 inside that oven (threadIdx.x = 2)
Formula: 3 × 8 + 2 = 26 ✅

Theater seat analogy: You're in seat #26 at a theater with 4 rows (blocks) of 8 seats (threads):

Row 3 (blockIdx.x = 3)
Seat 2 in that row (threadIdx.x = 2)
Global seat: 3 × 8 + 2 = 26 ✅

Visualizing Thread Indexing

Let's say we have 4 blocks, each with 8 threads (total = 32 threads):

Global ID:     [0][1][2][3][4][5][6][7] [8][9][10]...[26]...[31]
                └─── Block 0 ───┘ └─── Block 1 ───┘ ... └Block 3┘

threadIdx.x:   [0][1][2][3][4][5][6][7] [0][1][2]...[2]...[7]
                                                      ↑
                                                  Thread 26

blockIdx.x:    [    Block 0    ] [    Block 1    ] ... [  Block 3  ]
                                                         ↑
                                                     Thread 26 is here

For thread with global ID 26:

blockIdx.x = 3 (it's in block 3)
threadIdx.x = 2 (it's thread #2 within that block)
blockDim.x = 8 (each block has 8 threads)
Calculation: 3 × 8 + 2 = 26 ✅

This means: Thread 26 is responsible for processing array[26] in our vector!

Parallelism in CUDA: What's Automatic?

The GPU automatically handles:

✅ Thread scheduling: Assigning threads to physical cores
✅ Warp execution: Groups of 32 threads run together
✅ Block scheduling: Distributing blocks across streaming multiprocessors
✅ Core mapping: Deciding which thread runs on which core

You control:

🎯 How many threads and blocks to launch
🎯 What each thread should compute
🎯 How threads access memory

Example: How Our Vector Addition Uses Threads

In our vector_add_gpu.cu code:

// We launch the kernel with 256 threads per block
int threadsPerBlock = 256;
int blocksPerGrid = (N + 255) / 256;

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

Inside the kernel:

int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
    C[i] = A[i] + B[i];  // Each thread adds ONE element
}

For N = 1,000,000 elements:

threadsPerBlock = 256
blocksPerGrid = (1,000,000 + 255) / 256 = 3,907 blocks
Total threads launched: 3,907 × 256 = 1,000,192 threads
Each thread computes one element (threads beyond N do nothing due to the if check)

This means:

Thread 0 computes C[0] = A[0] + B[0]
Thread 1 computes C[1] = A[1] + B[1]
Thread 26 computes C[26] = A[26] + B[26]
...all happening at the same time! ⚡

Why This Makes GPUs So Fast

CPU approach (sequential):

Time = N iterations (one at a time)
For 1,000,000 elements = 1,000,000 time steps

GPU approach (parallel):

Time = (N / number_of_cores) iterations
For 1,000,000 elements with 1000 cores = 1,000 time steps

Speedup = 1000× faster! 🚀

Key Takeaways

✅ CUDA lets you use your GPU for any computation, not just graphics
✅ Threads run in parallel on thousands of GPU cores
✅ Each thread has a unique ID calculated by blockIdx.x * blockDim.x + threadIdx.x
✅ You write one kernel function, but it runs thousands of times simultaneously
✅ The GPU handles all the scheduling automatically – you just define the logic
✅ Memory management is explicit – you control what goes to GPU and when

Thread Blocks and Grids

How CUDA organizes threads:

CUDA doesn't just run thousands of independent threads. It structures threads into blocks, and blocks into grids.

Grid (2D or 3D structure)
  └── Thread Blocks (groups of threads)
        └── Threads (individual workers)

Key properties:

Property	Description
Thread blocks are independent	Each block can run in any order
Threads in a block run on the same SM	All threads in one block execute on one Streaming Multiprocessor (GPU core)
Threads in a block can communicate	They share memory and can synchronize
Blocks build a grid	Multiple blocks form the complete computation

Theater stage analogy: Think of a theater with multiple stages (SMs):

Actors = blocks
Stages = GPU cores (SMs)
Actors are assigned to stages when they're free
More space = more actors performing at once
The order doesn't matter!

2D and 3D Thread Indexing

For problems that are naturally multi-dimensional (like images or 3D models), CUDA provides 2D and 3D indexing.

CUDA provides the dim3 structure:

typedef struct {
   int x; int y; int z;
} dim3;

Example: Processing a 512×512 image

// Grid configuration: 16×16 blocks
dim3 gridDim(16, 16, 1);     // 16×16 = 256 blocks total

// Block configuration: 32×32 threads per block
dim3 blockDim(32, 32, 1);    // 32×32 = 1024 threads per block

// Calculate global 2D thread position
int th_x = blockIdx.x * blockDim.x + threadIdx.x;
int th_y = blockIdx.y * blockDim.y + threadIdx.y;

// Now th_x and th_y represent pixel coordinates in the image!

Why calculate global thread ID? Because each thread needs to know which part of data it should work on!

Built-in CUDA variables:

Variable	What It Means
`threadIdx`	Thread ID within its block (x, y, z)
`blockIdx`	Block ID within the grid (x, y, z)
`blockDim`	Size (number of threads) of a block
`gridDim`	Size (number of blocks) of the grid

How the GPU Schedules Blocks

Dynamic scheduling:

When you launch a CUDA kernel, you don't control exactly when or where each block runs. The GPU hardware does this automatically!

The process:

You define: Kernel function, number of blocks, threads per block
CUDA assigns: Blocks to SMs (Streaming Multiprocessors)
CUDA assigns: Threads to CUDA cores
No guarantee of order: Blocks can execute in any order
Dynamic allocation: If an SM has more resources, it will run more blocks

Key insight: You don't assign threads to cores manually — CUDA does it automatically! This is what makes CUDA so powerful and easy to use.

Understanding Warps and the Warp Scheduler

What is a warp?

A warp is a group of 32 consecutive threads that execute together. This is a fundamental hardware concept in CUDA.

How warps work:

Block of 256 threads
  ├── Warp 0: Threads 0-31
  ├── Warp 1: Threads 32-63
  ├── Warp 2: Threads 64-95
  ...
  └── Warp 7: Threads 224-255

Warp properties:

✅ All threads in a warp belong to the same block
✅ Threads are placed in warps sequentially
✅ All threads in a warp execute the same instruction at the same time (SIMT: Single Instruction, Multiple Threads)
✅ Only one warp executes per SM at any given time
✅ The GPU implements zero-overhead warp scheduling (switching between warps is free!)

What is a Warp Scheduler?

The warp scheduler is hardware inside each SM that decides which warp should run next.

How it works:

After a block is assigned to an SM, the SM splits it into warps
The warp scheduler picks which warp to run based on:
- Data availability
- Instruction readiness
- Priority policy
Eligible warps (whose operands are ready) are selected for execution
This happens automatically — you don't control it!

Student class analogy: Think of a warp like 32 students in class all solving the same math problem at the same time. They all follow the same steps, but each works on their own numbers. The teacher (warp scheduler) decides which group of 32 students to help next.

Serial vs Parallel Execution: The Big Difference

Let's see the difference between CPU (serial) and GPU (parallel) with vector addition:

CPU Approach (Serial):

// CPU code: Sequential execution
void vec_add_cpu(int size, float* a, float* b, float* result) {
    for (int i = 0; i < size; i++) {
        result[i] = a[i] + b[i];  // One at a time
    }
}

Timeline:

Time 0: result[0] = a[0] + b[0]
Time 1: result[1] = a[1] + b[1]
Time 2: result[2] = a[2] + b[2]
...
Time N: result[N] = a[N] + b[N]

GPU Approach (Parallel):

// GPU kernel: Parallel execution
__global__ void vec_add_gpu(int size, float* a, float* b, float* result) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < size) {
        result[i] = a[i] + b[i];  // All at once!
    }
}

Timeline:

Time 0: ALL results computed simultaneously!
  Thread 0: result[0] = a[0] + b[0]
  Thread 1: result[1] = a[1] + b[1]
  Thread 2: result[2] = a[2] + b[2]
  ...
  Thread N: result[N] = a[N] + b[N]

The difference:

CPU: N iterations (sequential) = N time steps
GPU: 1 iteration (parallel) = 1 time step (ignoring hardware limits)

Students analogy: Imagine 6 students solving 6 pairs of math problems:

CPU: One student does all 6 problems (slow)
GPU: Each student does 1 problem simultaneously (fast!)

What is a Kernel Function?

A kernel is a GPU function that:

✅ Runs on the GPU
✅ Is called from CPU code
✅ Executes with thousands of threads in parallel

Key properties:

Property	Description
No return value	Kernels can't return values directly
Output via arrays	Results must be written to arrays passed as parameters
Declare thread hierarchy	You specify blocks and threads when calling
Asynchronous execution	CPU continues immediately without waiting for GPU

Kernel definition (C++):

__global__ void myKernel(float* input, float* output, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        output[i] = input[i] * 2.0f;  // Example operation
    }
}

Kernel invocation (launch):

int threadsPerBlock = 256;
int blocksPerGrid = (N + 255) / 256;

// Launch the kernel
myKernel<<<blocksPerGrid, threadsPerBlock>>>(d_input, d_output, N);

What happens when you launch a kernel:

CPU sends command to GPU: "Run this kernel"
GPU creates blocks and threads according to your specification
CPU continues immediately (asynchronous!)
GPU executes kernel in parallel
CPU can check later if GPU finished (synchronization)

Real-world example: You want to apply a filter to 1,000 photos:

Write a kernel that processes one photo
Launch kernel: "Use 100 blocks, 10 threads each"
GPU applies filter to all 1,000 photos at the same time!

CUDA Memory Hierarchy

CUDA provides different types of memory, each with different speed, size, and scope.

Memory types and their properties:

Memory Type	Location	Scope	Lifetime	Speed	Size
Register	On-chip (SM)	Single thread	Thread	⚡⚡⚡ Fastest	Very small
Local	Off-chip (DRAM)	Single thread	Thread	🐌 Slow	Medium
Shared	On-chip (SM)	All threads in block	Block	⚡⚡ Very fast	Small (~48KB)
Global	Off-chip (DRAM)	All threads in grid	Application	🐌 Slowest	Large (GBs)
Constant	Off-chip (cached)	All threads in grid	Application	⚡ Fast (cached)	Small (~64KB)

Variable type qualifiers in CUDA:

__global__ void myKernel() {
    // Automatic variables → Registers (fastest!)
    int threadLocal = threadIdx.x;
    
    // Shared memory → Fast, shared within block
    __shared__ float sharedData[256];
    
    // Device memory → Slow, but large
    // Passed as pointer from host
}

// Global memory → Accessible by all
__device__ float globalVar;

// Constant memory → Read-only, cached
__constant__ float constVar;

Memory hierarchy visualization:

Thread
  └── Registers (private, fastest)
  └── Local Memory (private, slow - overflow from registers)

Block
  └── Shared Memory (shared within block, very fast)

Grid
  └── Global Memory (shared everywhere, large but slow)
  └── Constant Memory (read-only, cached, fast)

Performance tips:

✅ Use registers for thread-local variables (automatic)
✅ Use shared memory for data shared within a block (manual)
⚠️ Minimize global memory access (it's slow!)
✅ Use constant memory for read-only data (automatically cached)

Software vs Hardware: Understanding the Layers

It's crucial to understand that grid, block, thread are software concepts, while GPU, SM, CUDA core are hardware components.

Software Layer (Your Code):

Concept	What You Define
Grid	Collection of blocks you launch
Block	Group of threads (e.g., 256 threads)
Thread	Single instance of kernel function

Hardware Layer (Physical GPU):

Component	Physical Hardware
GPU	The entire graphics card
SM (Streaming Multiprocessor)	A GPU core that executes blocks
CUDA Core	Tiny compute unit that runs one thread at a time

Are they equivalent? NO!

❌ A grid ≠ a GPU
❌ A block ≠ an SM
❌ A thread ≠ a CUDA core

How they cooperate:

You write code defining grids, blocks, and threads
GPU hardware assigns blocks to SMs dynamically
Each SM breaks blocks into warps (32 threads each)
Warp scheduler assigns warps to CUDA cores
CUDA cores execute individual threads

Example with real numbers:

Your code:
- Grid: 1000 blocks
- Block size: 256 threads

Your hardware (e.g., RTX 3080):
- GPU: 1 device
- SMs: 68 streaming multiprocessors
- CUDA cores per SM: 128

What happens:
- GPU assigns multiple blocks to each SM
- Each SM runs blocks one (or more) at a time
- SM splits each block into 8 warps (256/32)
- Warp scheduler runs warps on CUDA cores
- Result: Massive parallelism!

Key insight: The hardware automatically handles all the mapping and scheduling. You just define the logic and structure — CUDA does the rest!

Putting It All Together: The Complete Picture

Let's trace what happens when you run our vector addition program:

1. CPU Code (Host):

vectorAddKernel<<<3907, 256>>>(d_A, d_B, d_C, N);

2. CUDA Creates Structure:

Grid: 3,907 blocks
Block size: 256 threads per block
Total threads: 3,907 × 256 = 1,000,192 threads

3. GPU Hardware Assignment:

GPU has (for example) 68 SMs
Each SM gets multiple blocks assigned
Blocks can run on any available SM

4. Each SM Processes Its Blocks:

SM splits each block into warps: 256 threads ÷ 32 = 8 warps
Warp scheduler decides which warp runs next
CUDA cores execute threads

5. Each Thread Computes:

int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
    C[i] = A[i] + B[i];
}

6. Memory Access:

Each thread reads from global memory: A[i], B[i]
Each thread computes: sum
Each thread writes to global memory: C[i]

7. Synchronization:

All threads complete
GPU signals CPU: "I'm done!"
CPU copies result back from GPU

This entire process happens in milliseconds with massive parallelism! 🚀

From Theory to Practice

Now that you understand these concepts, look at our cuda/vector_add_gpu.cu code again:

Can you identify where we calculate the global thread ID?
Can you see how each thread processes exactly one element?
Can you understand why we need the if (i < N) check?
Can you see how blocks are organized on the grid?
Do you understand what happens to the thread blocks on the SMs?

Try experimenting:

Change threadsPerBlock from 256 to 128 or 512 (notice how it affects warps!)
Modify the kernel to do vector subtraction instead
Add timing code to measure GPU speedup
Try processing 2D data (like images) using 2D blocks
Experiment with shared memory for block-level communication

The best way to learn CUDA is by experimenting! 💪

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cpu		cpu
cuda		cuda
README.md		README.md

muk0644/cuda-cpp-vector-addition-parallel-computing

Folders and files

Latest commit

History

Repository files navigation