Skip to content

A simple Python vector addition project using Numba CUDA, demonstrating CPU vs GPU parallel computing. Compares sequential CPU vector addition with GPU-accelerated Numba implementation. Shows how to write GPU kernels with @cuda.jit, manage device memory, and compare CPU vs GPU performance.

Notifications You must be signed in to change notification settings

muk0644/cuda-python-vector-addition-parallel-computing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Python Numba Vector Addition – Beginner Project

A beginner-friendly introduction to GPU programming in Python using Numba CUDA. This project demonstrates the difference between CPU and GPU computation using a simple vector addition example.

Perfect for absolute beginners who want to:

  • Learn GPU programming fundamentals
  • Understand parallel computing concepts
  • Build a portfolio project for GitHub
  • Explore NVIDIA CUDA with Python

📚 Introduction

What is CPU Computation?

The CPU (Central Processing Unit) is the main processor in your computer. When you run normal Python code, it executes on the CPU. The CPU is great for general-purpose tasks but processes instructions sequentially (one at a time).

Think of it like a single cashier at a supermarket checkout – they can only serve one customer at a time.

What is GPU Computation?

The GPU (Graphics Processing Unit) is a specialized processor originally designed for graphics. Modern GPUs have thousands of cores that can work in parallel (simultaneously).

Think of it like having hundreds of cashiers at the supermarket – they can serve many customers at the same time!

Why are GPUs Faster for Parallel Tasks?

When you need to perform the same operation on large amounts of data, GPUs excel because they can process many data elements simultaneously. This is called parallel processing.

For example:

  • Adding two arrays with 1,000,000 elements
  • Processing pixels in an image
  • Training machine learning models
  • Scientific simulations

What is CUDA?

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It's a set of tools and APIs that allows you to write programs that run on NVIDIA GPUs for general-purpose computing, not just graphics.

Key Points about CUDA:

  • 🎯 It's a GPU programming model made by NVIDIA
  • 🚀 Enables general-purpose GPU computing (not just graphics)
  • 💻 Traditionally uses C/C++ with special CUDA keywords
  • ⚡ Allows you to run thousands of tasks (threads) at the same time
  • 🔧 Gives you explicit control over GPU memory management
  • 📦 Includes a complete toolkit with compilers, libraries, and debugging tools

Example: Imagine you're training an AI model for face detection. It needs millions of math operations. Using only the CPU would be slow (sequential processing). But with CUDA, you can send these operations to the GPU, which processes many of them simultaneously, making it 10-100x faster!

CUDA Components

CUDA is not just one tool—it's a complete ecosystem:

Component Description Purpose
Driver Low-level software that controls the GPU Lets your computer communicate with the GPU hardware
Toolkit Complete development package Includes compiler (nvcc for C++), libraries, debugging tools, IDE plugins
CUDA C/C++ Extended C/C++ with GPU keywords Write GPU code with full control (requires learning C++)
Numba Just-in-time compiler for Python Write GPU code in Python without C++ (easier for Python developers!)

The CUDA Workflow (High-Level Process):

  1. CPU (Host) starts the program and initializes the GPU device
  2. CPU allocates memory on both CPU (host) and GPU (device)
  3. CPU copies data from host memory to device memory
  4. CPU launches kernel (GPU function) with specified threads/blocks
  5. GPU executes the kernel across thousands of threads in parallel
  6. GPU copies results back from device memory to host memory
  7. CPU continues processing or repeats steps 3-6 as needed
  8. Program cleans up memory and terminates

How Does Numba Help?

Numba is a Python library that compiles Python code to run on NVIDIA GPUs using CUDA. Instead of learning complex C++ CUDA programming, you can write GPU code in Python!

Key advantages:

  • ✅ Write GPU code in Python (no C/C++ required)
  • ✅ Easy to learn for Python developers
  • ✅ Great performance boost for numerical computations
  • ✅ Works with NumPy arrays
  • ✅ Just-in-time (JIT) compilation—code is compiled automatically when you run it
  • ✅ Automatic type inference—Numba figures out data types for you

How Numba Execution Works

When you use Numba's @cuda.jit decorator, here's what happens behind the scenes:

Your Python Code (@cuda.jit function)
          ↓
   Python Bytecode
          ↓
   Bytecode Analysis
          ↓
     Numba IR (Intermediate Representation)
          ↓
   Type Inference (figures out data types)
          ↓
   IR Optimization
          ↓
     LLVM IR (Low-Level Virtual Machine)
          ↓
   LLVM JIT Compilation
          ↓
   Machine Code (GPU binary)
          ↓
   Execute on GPU!

What this means: You write normal Python code, add @cuda.jit, and Numba automatically transforms it into GPU machine code. No manual compilation needed!

Example:

@cuda.jit
def my_gpu_function(a, b, c):
    # Your Python code here
    c[0] = a[0] + b[0]

Numba reads this, compiles it to GPU code, and executes it—all automatically!

What Python Code Works with Numba?

Numba doesn't support all Python features on the GPU. Here's what DOES work:

Supported on GPU:

  • if, elif, else statements
  • for and while loops
  • Basic math operators: +, -, *, /, **, %
  • Math module functions: math.sin(), math.cos(), math.sqrt(), etc.
  • Tuples
  • NumPy arrays

NOT Supported on GPU:

  • Strings (text operations)
  • Lists (use NumPy arrays instead)
  • Dictionaries
  • File I/O operations
  • Print statements (limited support)
  • Most Python libraries

Example:

# ✅ This works in Numba CUDA:
for i in range(1000):
    result[i] = math.sqrt(a[i] * a[i] + b[i] * b[i])

# ❌ This does NOT work in Numba CUDA:
message = "Hello"  # Strings not supported
my_list = [1, 2, 3]  # Lists not supported

🎯 Concepts Demonstrated

This project teaches the following concepts:

Concept Description
CPU vs GPU Computation Compare sequential CPU processing with parallel GPU processing
Numba CUDA Kernel Learn to write GPU functions using @cuda.jit decorator
Thread Indexing Understand cuda.threadIdx.x and how threads map to data
Memory Management Copy data between host (CPU) and device (GPU) memory
Parallel Execution See how thousands of GPU threads work simultaneously
Thread Hierarchy Understand threads, blocks, and grids organization
Warps Learn how GPU groups 32 threads together for execution

🧠 Deep Dive: CUDA Concepts

To truly understand GPU programming, you need to learn some core CUDA concepts. Don't worry—we'll explain everything with simple analogies!

🧵 Threads: The Basic Unit of Execution

What is a Thread?

A thread is the smallest unit of execution in CUDA. Think of it as one tiny worker doing one small task.

  • Each thread runs independently
  • Each thread executes the same code (the kernel function)
  • Each thread works on different data
  • Thousands of threads run simultaneously on the GPU

Analogy: Imagine 1,000 students all solving the same type of math problem (a + b), but each student has different numbers. Each student is like one thread.

Example from our project:

# Thread 0: computes c[0] = a[0] + b[0]
# Thread 1: computes c[1] = a[1] + b[1]
# Thread 2: computes c[2] = a[2] + b[2]
# ... all happening at the same time!

📦 Thread Blocks: Organizing Threads

What is a Block?

Threads are organized into blocks. A block is a group of threads that:

  • Run on the same SM (Streaming Multiprocessor—a physical GPU core)
  • Can communicate and share memory with each other
  • Can be synchronized (wait for each other)

Typical block sizes: 32, 64, 128, 256, 512, or 1024 threads per block.

Analogy: Think of a block as a classroom with 32 students (threads). All students in one classroom can work together and share resources.

🗂️ Grid: The Complete Collection

What is a Grid?

A grid is a collection of blocks. When you launch a kernel, you define:

  • How many blocks you want
  • How many threads per block

The GPU then organizes and executes all these threads.

Analogy: If a block is a classroom, a grid is the entire school with many classrooms.

Visual Hierarchy:

GPU Device
   └─ Grid (launched by CPU)
       ├─ Block 0
       │   ├─ Thread 0
       │   ├─ Thread 1
       │   └─ ... Thread 255
       ├─ Block 1
       │   ├─ Thread 0
       │   ├─ Thread 1
       │   └─ ... Thread 255
       └─ Block 2...

🎯 Thread Identification: Finding Your Place

Every thread needs to know which data it should work on. CUDA provides built-in variables:

Variable What it tells you
cuda.threadIdx.x Thread's position within its block (0 to blockDim-1)
cuda.blockIdx.x Block's position within the grid (0 to gridDim-1)
cuda.blockDim.x Total number of threads in one block
cuda.gridDim.x Total number of blocks in the grid

Computing Global Thread ID:

Each thread calculates its unique global index:

global_id = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x

Example with Real Numbers:

Imagine 4 blocks, each with 8 threads (total 32 threads):

Global ID:  0  1  2  3  4  5  6  7 | 8  9 10 11 12 13 14 15 | 16 17 18 19 20 21 22 23 | 24 25 26 27 28 29 30 31
Thread ID:  0  1  2  3  4  5  6  7 | 0  1  2  3  4  5  6  7 | 0  1  2  3  4  5  6  7 | 0  1  2  3  4  5  6  7
Block ID:         Block 0          |        Block 1         |        Block 2         |        Block 3

Finding thread #26:

  • It's in Block 3 (blockIdx.x = 3)
  • It's thread #2 within that block (threadIdx.x = 2)
  • Global ID: 3 × 8 + 2 = 26 ✓

Theater Analogy: You have seat #26 in a theater with 4 rows (blocks), each row having 8 seats (threads). Your seat is: Row 3 × 8 seats + seat 2 = seat 26.

🌊 Warps: How GPU Actually Executes Threads

What is a Warp?

A warp is a group of 32 consecutive threads that execute together. This is a hardware concept.

Key Facts:

  • The GPU hardware automatically divides blocks into warps of 32 threads
  • All threads in a warp execute the same instruction at the same time (SIMT: Single Instruction, Multiple Threads)
  • Only one warp executes per cycle on an SM, but many warps are scheduled
  • Warp scheduling has zero overhead—switching between warps is instant

Why 32? This is a hardware design choice by NVIDIA. All CUDA-capable GPUs use warps of 32 threads.

Analogy: Think of 32 students in a classroom all solving the same math problem at exactly the same time. The teacher (warp scheduler) gives one instruction, and all 32 students follow it simultaneously.

Important: You don't explicitly create warps in your code. The GPU automatically handles this. But understanding warps helps you write more efficient code.

🏗️ Software vs Hardware: How They Map

This is crucial to understand:

Software (What You Define) Hardware (Physical GPU)
Grid Entire GPU Device
Block Assigned to one SM (Streaming Multiprocessor)
Thread Runs on one CUDA Core
Warp Group of 32 threads scheduled together

Important Notes:

  • You define grids and blocks in your code
  • CUDA/GPU automatically assigns blocks to SMs
  • CUDA/GPU automatically assigns threads to CUDA cores
  • You don't manually assign threads to specific cores
  • Blocks are scheduled dynamically—no guaranteed execution order
  • One SM can run multiple blocks if it has enough resources

🎭 The Complete Flow: From Code to Execution

What You Do (Software):

  1. Write a kernel function with @cuda.jit
  2. Define: threads_per_block = 256
  3. Define: blocks_per_grid = 100
  4. Launch: my_kernel[blocks_per_grid, threads_per_block](...)

What CUDA Does Automatically (Hardware):

  1. Creates 100 blocks × 256 threads = 25,600 threads total
  2. Assigns each block to an available SM (GPU core)
  3. Divides each block into warps (256 ÷ 32 = 8 warps per block)
  4. The warp scheduler in each SM decides which warp runs next
  5. Threads in the active warp execute on CUDA cores
  6. Process repeats until all blocks finish

💾 CUDA Memory Hierarchy

Understanding GPU memory is crucial for writing efficient CUDA code.

Memory Types

Memory Type Location Speed Scope Lifetime Size
Registers On-chip (inside SM) Fastest Per thread Thread duration Very small (~64KB per SM)
Shared Memory On-chip (inside SM) Very fast Per block Block duration Small (~48-96KB per SM)
Local Memory Off-chip (DRAM) Slow Per thread Thread duration Large
Global Memory Off-chip (DRAM) Slow Entire grid Application duration Very large (GB)
Constant Memory Off-chip (DRAM, cached) Medium Entire grid Application duration 64KB

Memory Explained Simply

1. Global Memory (What we use in this project)

  • Largest and slowest
  • Accessible by all threads
  • Where you allocate arrays with cuda.device_array() or cuda.to_device()
  • Must explicitly copy data between host (CPU) and device (GPU)

2. Shared Memory

  • Fast, small, shared within a block
  • Threads in the same block can share data quickly
  • Manually managed (advanced topic)
  • Great for optimization

3. Registers

  • Fastest memory
  • Automatic—compiler uses them for local variables
  • Limited per thread

4. Constant Memory

  • Read-only for kernels
  • Good for values that don't change
  • Cached for fast access

Memory Transfer Pattern (This Project):

CPU Memory (Host)                  GPU Memory (Device)
    a, b arrays
        ↓
  cuda.to_device() ──────────────→  a_device, b_device
                                          ↓
                                    GPU Kernel Executes
                                    (uses Global Memory)
                                          ↓
                                      c_device
        ↓
  c = c_device.copy_to_host() ←──────────
        ↓
    c array

Important: Data transfer between CPU and GPU is slow. For real applications:

  • Minimize transfers
  • Keep data on GPU as long as possible
  • Process multiple operations on GPU before copying results back

🚀 Why This Project Matters

Vector addition is the "Hello World" of GPU programming!

Just like printing "Hello World" is the first step in learning a programming language, vector addition is the first step in learning GPU programming. It's simple enough to understand but demonstrates the core concepts:

  • Writing GPU kernels
  • Managing GPU memory
  • Launching parallel threads
  • Understanding speedup from parallelization

Once you understand vector addition on the GPU, you can apply these concepts to more complex problems like:

  • Matrix multiplication
  • Image processing
  • Deep learning
  • Scientific computing

📁 Project Structure

python-numba-vector-addition/
├── cpu/
│   └── vector_add_cpu.py      # CPU implementation (sequential)
├── cuda/
│   └── vector_add_gpu.py      # GPU implementation (parallel)
└── README.md                  # This file

What Each File Does

cpu/vector_add_cpu.py

  • Language: Standard Python
  • Hardware: Runs on CPU only
  • Method: Uses a simple for-loop to add vectors element by element
  • Speed: Slower for large datasets (sequential processing)
  • Purpose: Demonstrates traditional CPU approach

Key Code:

for i in range(n):
    c[i] = a[i] + b[i]  # One addition at a time

cuda/vector_add_gpu.py

  • Language: Python + Numba CUDA
  • Hardware: Runs on NVIDIA GPU
  • Method: Uses GPU kernel with thousands of parallel threads
  • Speed: Much faster for large datasets (parallel processing)
  • Purpose: Demonstrates modern GPU approach

Key Code:

@cuda.jit
def vector_add_kernel(a, b, c):
    idx = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    if idx < c.size:
        c[idx] = a[idx] + b[idx]  # Many additions at the same time!

🛠️ Requirements

Hardware Requirements

  • NVIDIA GPU (GeForce, Quadro, Tesla, RTX, etc.)
    • Any CUDA-capable NVIDIA GPU will work
    • To check if you have one: Run nvidia-smi in terminal

Software Requirements

  • Python 3.7+
  • CUDA Toolkit (usually installed with NVIDIA drivers)
  • Numba package

Installing Numba

Install Numba using pip:

pip install numba

This will automatically install NumPy as well, which is required for array operations.


💻 How to Run

Running the CPU Version

The CPU version uses standard Python and will run on any computer:

python cpu/vector_add_cpu.py

Expected Output:

============================================================
CPU VECTOR ADDITION - Standard Python
============================================================

Creating two vectors with 10,000,000 elements each...
Vector A: first 10 elements = [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Vector B: first 10 elements = [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

Performing vector addition on CPU...
Result C: first 10 elements = [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
Expected: [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]

✓ CPU computation completed in X.XXXX seconds

Running the GPU Version

The GPU version requires an NVIDIA GPU and Numba CUDA:

python cuda/vector_add_gpu.py

Expected Output:

✓ CUDA is available!
✓ Detected GPU: NVIDIA GeForce RTX XXXX

============================================================
GPU VECTOR ADDITION - Python + Numba CUDA
============================================================

Creating two vectors with 10,000,000 elements each...
Vector A: first 10 elements = [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Vector B: first 10 elements = [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

Step 1: Copying data from CPU to GPU...
✓ Data copied to GPU memory

Step 2: Configuring GPU execution...
  - Threads per block: 256
  - Blocks in grid: 39,063
  - Total threads: 10,000,128

Step 3: Executing vector addition on GPU...
✓ GPU computation completed in X.XXXX seconds

Step 4: Copying result from GPU back to CPU...
✓ Result copied back to CPU memory

RESULTS:
Result C: first 10 elements = [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
Expected: [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]

✓ Verification PASSED: GPU result matches expected values!

📖 Understanding the Code

CPU Version – Key Concepts

def vector_add_cpu(a, b):
    n = len(a)
    c = np.zeros(n)
    
    # Sequential loop - one addition at a time
    for i in range(n):
        c[i] = a[i] + b[i]
    
    return c

What happens:

  1. Create empty result array
  2. Loop through each index (0, 1, 2, ...)
  3. Add corresponding elements one by one
  4. Each addition happens after the previous one finishes

Performance: Slow for large arrays because operations are sequential.


GPU Version – Key Concepts

The Kernel Function

@cuda.jit  # Tells Numba: compile this for GPU!
def vector_add_kernel(a, b, c):
    # Calculate this thread's unique global index
    idx = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    
    # Bounds check (important when thread count > array size)
    if idx < c.size:
        # Each thread computes ONE element
        c[idx] = a[idx] + b[idx]

What happens:

  1. GPU launches thousands of threads simultaneously
  2. Each thread calculates its unique global index (idx)
  3. Thread with idx=0 computes c[0] = a[0] + b[0]
  4. Thread with idx=1 computes c[1] = a[1] + b[1]
  5. Thread with idx=2 computes c[2] = a[2] + b[2]
  6. ... all at the same time (parallel execution)!
  7. Threads with idx ≥ array size do nothing (bounds check)

Performance: Much faster because thousands of additions happen in parallel.

Launching the Kernel

# Configuration
threads_per_block = 256  # Each block has 256 threads
blocks_per_grid = (n + threads_per_block - 1) // threads_per_block

# Launch the kernel
vector_add_kernel[blocks_per_grid, threads_per_block](a_device, b_device, c_device)

The Launch Configuration: [blocks, threads]

This tells CUDA:

  • How many blocks to create
  • How many threads per block

Example Calculation:

  • Array size: 10,000,000 elements
  • Threads per block: 256
  • Blocks needed: 10,000,000 ÷ 256 = 39,063 blocks (rounded up)
  • Total threads launched: 39,063 × 256 = 10,000,128 threads
  • Extra threads: 10,000,128 - 10,000,000 = 128 (these do nothing thanks to bounds check)

Why the bounds check if idx < c.size:?

Because we often launch more threads than needed! We round up to the nearest block size. Without the bounds check, extra threads would access invalid memory (crash!).


CUDA Keywords and Functions Explained

Decorators and Kernel Definition

Keyword Explanation Example
@cuda.jit Decorator that compiles the function to run on GPU @cuda.jit
def my_kernel(...):

Thread Identification (Built-in Variables)

Variable Type Explanation Range
cuda.threadIdx.x int Thread ID within its block 0 to (blockDim.x - 1)
cuda.threadIdx.y int Thread ID (Y-axis) for 2D/3D blocks 0 to (blockDim.y - 1)
cuda.threadIdx.z int Thread ID (Z-axis) for 3D blocks 0 to (blockDim.z - 1)
cuda.blockIdx.x int Block ID within the grid 0 to (gridDim.x - 1)
cuda.blockIdx.y int Block ID (Y-axis) for 2D/3D grids 0 to (gridDim.y - 1)
cuda.blockIdx.z int Block ID (Z-axis) for 3D grids 0 to (gridDim.z - 1)
cuda.blockDim.x int Number of threads per block (X) Set by you at launch
cuda.blockDim.y int Number of threads per block (Y) Set by you at launch
cuda.blockDim.z int Number of threads per block (Z) Set by you at launch
cuda.gridDim.x int Number of blocks in grid (X) Calculated by CUDA

Computing Global Thread ID:

# 1D case (our project):
idx = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x

# 2D case (for images):
row = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x

Memory Management Functions

Function Purpose Returns Example
cuda.to_device(array) Copy array from CPU to GPU GPU array a_gpu = cuda.to_device(a_cpu)
cuda.device_array(shape) Allocate empty array on GPU GPU array c_gpu = cuda.device_array(1000)
gpu_array.copy_to_host() Copy GPU array back to CPU NumPy array result = c_gpu.copy_to_host()
cuda.synchronize() Wait for all GPU operations to finish None cuda.synchronize()

GPU Information Functions

Function Purpose Example
cuda.is_available() Check if CUDA GPU is available if cuda.is_available():
cuda.get_current_device() Get current GPU device object device = cuda.get_current_device()
device.name Get GPU name print(device.name.decode())

What is a Kernel Function?

A kernel is a special function that:

  • Is defined with @cuda.jit decorator (or __global__ in C++ CUDA)
  • Runs on the GPU (device)
  • Is called/launched from CPU code (host)
  • Cannot explicitly return values (must write results to arrays)
  • Runs asynchronously (CPU doesn't wait for it to finish by default)

Key Characteristics:

Characteristic Description
Execution Runs on GPU, launched from CPU
Returns Cannot return values—must write to output arrays
Declaration Must specify blocks and threads when launching
Asynchronous CPU continues immediately after launch (unless you synchronize)

Launching a Kernel (Invocation):

# Define the configuration
my_kernel[blocks_per_grid, threads_per_block](arg1, arg2, arg3)
#         ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^^
#         Grid config      Block config      Function arguments

Asynchronous Execution:

# CPU launches kernel
my_kernel[100, 256](a, b, c)  # CPU doesn't wait!

# CPU continues immediately
print("Kernel launched!")  # This runs while GPU is still working

# Explicitly wait for GPU to finish
cuda.synchronize()  # Now CPU waits for GPU

print("GPU finished!")  # This runs after GPU completes

Memory Transfer Flow

CPU (Host) Memory          GPU (Device) Memory
      
      A, B  ─────────────────────>  A, B          (cuda.to_device)
                                       │
                                       │ GPU Kernel
                                       │ Computes C
                                       ↓
                                       C
                                       
      C     <─────────────────────  C             (.copy_to_host)

Important: Moving data between CPU and GPU takes time! For real applications, you want to minimize these transfers and keep data on the GPU as much as possible.

Transfer Bottleneck Example:

  • GPU computation: 0.001 seconds ⚡ (very fast)
  • CPU→GPU transfer: 0.010 seconds 🐌 (10x slower!)
  • GPU→CPU transfer: 0.010 seconds 🐌
  • Total time: 0.021 seconds (transfer dominates!)

Optimization Strategy:

  • Batch multiple operations on GPU before transferring back
  • Reuse GPU memory across multiple kernel calls
  • Overlap computation with transfers (advanced)

⚡ Performance Comparison

For a vector with 10 million elements:

Implementation Hardware Time (typical) Speedup
CPU (Python loop) CPU ~1-3 seconds 1× (baseline)
GPU (Numba CUDA) GPU ~0.01-0.1 seconds 10-100× faster

Note: Actual speedup depends on:

  • GPU model (newer = faster)
  • Vector size (larger = better GPU advantage)
  • Data transfer overhead
  • CPU model

For very small vectors (< 10,000 elements), the CPU might be faster because the overhead of copying data to/from GPU dominates the computation time.


🎓 Learning Path

After understanding this project, you can explore:

  1. Matrix Multiplication – 2D arrays and more complex thread indexing
  2. Image Processing – Apply filters to images using GPU
  3. Reduction Operations – Sum, max, min across large arrays
  4. Shared Memory – Advanced technique for faster GPU computation
  5. CuPy – NumPy-like library that runs entirely on GPU
  6. PyTorch/TensorFlow – Deep learning frameworks that use GPUs

📝 Important Notes

CPU Version

  • ✅ Uses standard Python – runs on any computer
  • ✅ No special hardware required
  • ✅ Easy to understand and modify
  • ⚠️ Slower for large datasets (sequential processing)

GPU Version

  • ✅ Uses Numba CUDA – runs on NVIDIA GPUs
  • ✅ Much faster for large datasets (parallel processing)
  • ⚠️ Requires NVIDIA GPU with CUDA support
  • ⚠️ Requires Numba package (pip install numba)
  • ⚠️ Data transfer between CPU/GPU adds overhead

🐛 Troubleshooting

"CUDA is not available"

Problem: GPU code won't run

Solutions:

  1. Check if you have an NVIDIA GPU: nvidia-smi
  2. Install/update NVIDIA drivers
  3. Install CUDA Toolkit
  4. Install Numba: pip install numba

"No module named 'numba'"

Problem: Numba is not installed

Solution:

pip install numba

Very slow GPU performance

Possible reasons:

  1. Vector size is too small (GPU overhead dominates)
  2. Old/low-end GPU
  3. Using integrated graphics instead of dedicated GPU

🌟 Why This is Portfolio-Worthy

This project demonstrates:

Understanding of parallel computing – You understand CPU vs GPU
Modern Python skills – Using advanced libraries like Numba
Performance optimization – You know when and how to use GPUs
Clear documentation – Professional README and comments
Practical application – Real-world performance comparison

Perfect for:

  • GitHub portfolio showcasing GPU programming skills
  • Learning foundation for machine learning (GPUs power modern AI)
  • Understanding high-performance computing concepts
  • Interview talking point for data science/ML positions

📚 Additional Resources


👨‍💻 Target Audience

This project is designed for:

  • Absolute beginners in GPU programming
  • Python developers wanting to learn CUDA
  • Students learning parallel computing
  • Data scientists exploring GPU acceleration
  • Anyone building a GitHub portfolio

Prerequisites:

  • Basic Python knowledge (functions, loops, arrays)
  • Understanding of NumPy arrays (helpful but not required)
  • No prior CUDA or GPU programming experience needed!

📄 License

This project is open source and available for educational purposes.


🙋 Questions?

Common questions beginners ask:

Q: Do I need a powerful GPU?
A: No! Any CUDA-capable NVIDIA GPU will work, even older models.

Q: Will this work on AMD GPUs?
A: No, Numba CUDA only works with NVIDIA GPUs. For AMD, look into ROCm.

Q: Can I run this on Google Colab?
A: Yes! Colab provides free NVIDIA Tesla GPUs. Perfect for learning!

Q: Is this faster than NumPy?
A: NumPy is already optimized. GPU shines for custom operations or very large arrays.

Q: What's next after this project?
A: Try matrix multiplication, image processing, or explore deep learning frameworks!


Happy GPU Programming! 🚀

About

A simple Python vector addition project using Numba CUDA, demonstrating CPU vs GPU parallel computing. Compares sequential CPU vector addition with GPU-accelerated Numba implementation. Shows how to write GPU kernels with @cuda.jit, manage device memory, and compare CPU vs GPU performance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages