Sparse Matrix-Vector Multiplication (SpMV)

A high-performance implementation of Sparse Matrix-Vector Multiplication in C++ with serial, parallel (OpenMP), and GPU-accelerated (CUDA) versions, demonstrating the performance benefits of parallelism across different architectures.

📋 Project Overview

This project implements SpMV using the Compressed Sparse Row (CSR) format, which is memory-efficient and optimized for sparse matrix operations. The implementation includes:

Serial Implementation: Sequential SpMV algorithm (CPU baseline)
OpenMP Parallel Implementation: CPU multi-core parallel SpMV with row-wise parallelization
CUDA Parallel Implementation: GPU-accelerated SpMV using NVIDIA CUDA (NEW! ⚡)
Validation Suite: Comprehensive testing against dense matrix multiplication
Performance Benchmarking: Detailed performance analysis with speedup and efficiency metrics
Visualization: Python scripts to generate performance graphs

🏗️ Project Structure

.
├── SparseMatrix.h          # Sparse matrix class header (CSR format)
├── SparseMatrix.cpp        # Sparse matrix implementation
├── serial_spmv.h           # Serial SpMV header
├── serial_spmv.cpp         # Serial SpMV implementation
├── parallel_spmv.h         # Parallel SpMV header (OpenMP)
├── parallel_spmv.cpp       # Parallel SpMV implementation (OpenMP)
├── cuda_spmv.cuh           # CUDA SpMV header (GPU)
├── cuda_spmv.cu            # CUDA SpMV implementation (GPU)
├── validate.cpp            # Validation test program (Serial + OpenMP)
├── validate_cuda.cpp       # Validation with CUDA tests
├── benchmark.cpp           # Performance benchmarking (Serial + OpenMP)
├── benchmark_cuda.cpp      # Performance benchmarking (with CUDA)
├── plot_results.py         # Visualization script (Serial + OpenMP)
├── plot_cuda_results.py    # Visualization script (with CUDA)
├── Makefile                # Build automation (Serial + OpenMP)
├── Makefile.cuda           # Build automation (with CUDA)
├── CUDA_GUIDE.md           # CUDA implementation guide
└── README.md               # This file

🔧 Requirements

CPU-Only Build Requirements

C++ Compiler: GCC 7.0+ or Clang 6.0+ with C++17 support
OpenMP: For parallel implementation (usually included with GCC)
Make: For build automation

CUDA Build Requirements (Optional - for GPU acceleration)

NVIDIA GPU: CUDA Compute Capability 6.0 or higher
CUDA Toolkit: Version 11.0 or higher
NVIDIA Drivers: Latest recommended
See CUDA_GUIDE.md for detailed setup instructions

Visualization Requirements (Optional)

Python 3.6+
Libraries: matplotlib, pandas, numpy

Install Python dependencies:

pip3 install matplotlib pandas numpy

🚀 Quick Start (CPU Version)

1. Build the Project

make all

This compiles both the validation and benchmark programs.

2. Run Validation Tests

make test
# or
./validate

This validates the correctness of both serial and parallel implementations.

3. Run Performance Benchmarks

make bench
# or
./benchmark

This runs performance tests on multiple matrix sizes and generates benchmark_results.csv.

4. Generate Performance Graphs

make plot
# or
python3 plot_results.py

This creates graphs in the results/ directory:

execution_time.png - Execution time comparison
speedup.png - Speedup vs problem size
efficiency.png - Parallel efficiency
execution_time_log.png - Log-scale comparison

5. Complete Workflow

make run

This builds, tests, benchmarks, and generates plots in one command.

⚡ CUDA GPU Version (Optional)

Prerequisites

NVIDIA GPU with CUDA support
CUDA Toolkit installed

Build and Run

# Build CUDA programs
make -f Makefile.cuda all

# Validate correctness
make -f Makefile.cuda test

# Run benchmarks
make -f Makefile.cuda bench

# Generate graphs
make -f Makefile.cuda plot

# Complete workflow
make -f Makefile.cuda run

CUDA Performance Graphs

Generated in results_cuda/ directory:

execution_time_cuda.png - Serial vs OpenMP vs CUDA comparison
speedup_comparison.png - OpenMP and CUDA speedups
performance_bars.png - Bar chart comparison
execution_time_log.png - Log-scale performance
cuda_vs_openmp.png - CUDA vs OpenMP relative performance

For detailed CUDA instructions, see CUDA_GUIDE.md

📊 Algorithm Design

CSR Format

The Compressed Sparse Row format stores only non-zero elements:

values: Array of non-zero values
col_indices: Column index for each value
row_ptr: Starting index of each row in values array

Serial Algorithm

for each row i:
    result[i] = 0
    for j from row_ptr[i] to row_ptr[i+1]:
        result[i] += values[j] * vector[col_indices[j]]

Parallel Algorithm (OpenMP)

#pragma omp parallel for
for each row i:
    result[i] = 0
    for j from row_ptr[i] to row_ptr[i+1]:
        result[i] += values[j] * vector[col_indices[j]]

Key advantages of row-wise parallelization:

No race conditions (each thread writes to different result elements)
Good load balancing with static scheduling
Cache-friendly access pattern

CUDA Algorithm (GPU)

__global__ void spmv_kernel() {
    row = blockIdx.x * blockDim.x + threadIdx.x
    
    if (row < num_rows) {
        sum = 0.0
        for j from row_ptr[row] to row_ptr[row+1]:
            sum += values[j] * vector[col_indices[j]]
        result[row] = sum
    }
}

Key advantages of GPU parallelization:

Massive parallelism (thousands of threads)
Optimized for compute-intensive workloads
Shared memory caching for small vectors
Excellent for large-scale problems

🧪 Validation

The validation suite tests both implementations on:

Small predefined matrices (4×4)
Random sparse matrices (100×100, 10% density)
Very sparse matrices (200×200, 2% density)
Diagonal matrices (5×5)

All results are compared against ground truth dense matrix multiplication with tolerance of 1e-10.

📈 Performance Evaluation

Benchmark Configuration

Matrix Sizes: 100, 200, 500, 1000, 2000, 5000, 10000
Density: 10% (configurable)
Warmup Runs: 2
Timing Runs: 5 (averaged)

Metrics

Execution Time: Time to complete SpMV operation
Speedup: S = T_serial / T_parallel
Efficiency: E = Speedup / num_threads
Scalability: Performance across problem sizes

🛠️ Makefile Targets

make all        # Build all programs
make validate   # Build validation program
make benchmark  # Build benchmark program
make test       # Run validation tests
make bench      # Run benchmarks
make plot       # Generate graphs
make run        # Full workflow (build + test + bench + plot)
make clean      # Remove build artifacts
make rebuild    # Clean and rebuild

🔍 Implementation Details

Thread Safety

Row-wise parallelization ensures no race conditions
Each thread computes independent rows
No synchronization needed within main computation loop

Optimization Techniques

CSR Format: Minimizes memory usage and cache misses
Static Scheduling: Balanced workload distribution
Compiler Optimization: -O3 flag for maximum performance
Loop Vectorization: Enabled by modern compilers

Scalability Considerations

Performance scales well with matrix size
Efficiency depends on number of threads and problem size
Small matrices may show overhead from parallelization
Large matrices demonstrate significant speedup

📝 Usage Examples

Running with Custom Parameters

Modify the benchmark configuration in benchmark.cpp:

std::vector<int> sizes = {100, 500, 1000, 5000, 10000};
double density = 0.05;  // 5% density

Controlling Thread Count

Set OpenMP environment variable:

export OMP_NUM_THREADS=4
./benchmark

Or modify in code:

int num_threads = 4;
spmv_parallel(matrix, vector, num_threads);

🐛 Troubleshooting

OpenMP Not Found

If you get OpenMP errors:

# Ubuntu/Debian
sudo apt-get install libomp-dev

# macOS (using Homebrew)
brew install libomp

Python Plotting Issues

If matplotlib is missing:

pip3 install --user matplotlib pandas numpy

📚 References

CSR Format: Saad, Y. (2003). Iterative Methods for Sparse Linear Systems
OpenMP: OpenMP Architecture Review Board. OpenMP Application Programming Interface
Performance Analysis: Foster, I. (1995). Designing and Building Parallel Programs

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
src		src
.gitignore		.gitignore
Final Project Report.pdf		Final Project Report.pdf
Makefile		Makefile
Makefile.cuda		Makefile.cuda
README.md		README.md
REPORT_CHECKLIST.md		REPORT_CHECKLIST.md
REPORT_GUIDE.md		REPORT_GUIDE.md
SUBMISSION_READY.md		SUBMISSION_READY.md
benchmark_results (1).csv		benchmark_results (1).csv
compile_report.sh		compile_report.sh
report.aux		report.aux
report.fdb_latexmk		report.fdb_latexmk
report.fls		report.fls
report.log		report.log
report.pdf		report.pdf
report.synctex.gz		report.synctex.gz
report.tex		report.tex

ranimeshehata/Sparse-Matrix-Vector-Multiplication

Folders and files

Latest commit

History

Repository files navigation