A high-performance implementation of Sparse Matrix-Vector Multiplication in C++ with serial, parallel (OpenMP), and GPU-accelerated (CUDA) versions, demonstrating the performance benefits of parallelism across different architectures.
This project implements SpMV using the Compressed Sparse Row (CSR) format, which is memory-efficient and optimized for sparse matrix operations. The implementation includes:
- Serial Implementation: Sequential SpMV algorithm (CPU baseline)
- OpenMP Parallel Implementation: CPU multi-core parallel SpMV with row-wise parallelization
- CUDA Parallel Implementation: GPU-accelerated SpMV using NVIDIA CUDA (NEW! β‘)
- Validation Suite: Comprehensive testing against dense matrix multiplication
- Performance Benchmarking: Detailed performance analysis with speedup and efficiency metrics
- Visualization: Python scripts to generate performance graphs
.
βββ SparseMatrix.h # Sparse matrix class header (CSR format)
βββ SparseMatrix.cpp # Sparse matrix implementation
βββ serial_spmv.h # Serial SpMV header
βββ serial_spmv.cpp # Serial SpMV implementation
βββ parallel_spmv.h # Parallel SpMV header (OpenMP)
βββ parallel_spmv.cpp # Parallel SpMV implementation (OpenMP)
βββ cuda_spmv.cuh # CUDA SpMV header (GPU)
βββ cuda_spmv.cu # CUDA SpMV implementation (GPU)
βββ validate.cpp # Validation test program (Serial + OpenMP)
βββ validate_cuda.cpp # Validation with CUDA tests
βββ benchmark.cpp # Performance benchmarking (Serial + OpenMP)
βββ benchmark_cuda.cpp # Performance benchmarking (with CUDA)
βββ plot_results.py # Visualization script (Serial + OpenMP)
βββ plot_cuda_results.py # Visualization script (with CUDA)
βββ Makefile # Build automation (Serial + OpenMP)
βββ Makefile.cuda # Build automation (with CUDA)
βββ CUDA_GUIDE.md # CUDA implementation guide
βββ README.md # This file
- C++ Compiler: GCC 7.0+ or Clang 6.0+ with C++17 support
- OpenMP: For parallel implementation (usually included with GCC)
- Make: For build automation
- NVIDIA GPU: CUDA Compute Capability 6.0 or higher
- CUDA Toolkit: Version 11.0 or higher
- NVIDIA Drivers: Latest recommended
- See CUDA_GUIDE.md for detailed setup instructions
- Python 3.6+
- Libraries:
matplotlib,pandas,numpy
Install Python dependencies:
pip3 install matplotlib pandas numpymake allThis compiles both the validation and benchmark programs.
make test
# or
./validateThis validates the correctness of both serial and parallel implementations.
make bench
# or
./benchmarkThis runs performance tests on multiple matrix sizes and generates benchmark_results.csv.
make plot
# or
python3 plot_results.pyThis creates graphs in the results/ directory:
execution_time.png- Execution time comparisonspeedup.png- Speedup vs problem sizeefficiency.png- Parallel efficiencyexecution_time_log.png- Log-scale comparison
make runThis builds, tests, benchmarks, and generates plots in one command.
- NVIDIA GPU with CUDA support
- CUDA Toolkit installed
# Build CUDA programs
make -f Makefile.cuda all
# Validate correctness
make -f Makefile.cuda test
# Run benchmarks
make -f Makefile.cuda bench
# Generate graphs
make -f Makefile.cuda plot
# Complete workflow
make -f Makefile.cuda runGenerated in results_cuda/ directory:
execution_time_cuda.png- Serial vs OpenMP vs CUDA comparisonspeedup_comparison.png- OpenMP and CUDA speedupsperformance_bars.png- Bar chart comparisonexecution_time_log.png- Log-scale performancecuda_vs_openmp.png- CUDA vs OpenMP relative performance
For detailed CUDA instructions, see CUDA_GUIDE.md
The Compressed Sparse Row format stores only non-zero elements:
- values: Array of non-zero values
- col_indices: Column index for each value
- row_ptr: Starting index of each row in values array
for each row i:
result[i] = 0
for j from row_ptr[i] to row_ptr[i+1]:
result[i] += values[j] * vector[col_indices[j]]
#pragma omp parallel for
for each row i:
result[i] = 0
for j from row_ptr[i] to row_ptr[i+1]:
result[i] += values[j] * vector[col_indices[j]]
Key advantages of row-wise parallelization:
- No race conditions (each thread writes to different result elements)
- Good load balancing with static scheduling
- Cache-friendly access pattern
__global__ void spmv_kernel() {
row = blockIdx.x * blockDim.x + threadIdx.x
if (row < num_rows) {
sum = 0.0
for j from row_ptr[row] to row_ptr[row+1]:
sum += values[j] * vector[col_indices[j]]
result[row] = sum
}
}Key advantages of GPU parallelization:
- Massive parallelism (thousands of threads)
- Optimized for compute-intensive workloads
- Shared memory caching for small vectors
- Excellent for large-scale problems
The validation suite tests both implementations on:
- Small predefined matrices (4Γ4)
- Random sparse matrices (100Γ100, 10% density)
- Very sparse matrices (200Γ200, 2% density)
- Diagonal matrices (5Γ5)
All results are compared against ground truth dense matrix multiplication with tolerance of 1e-10.
- Matrix Sizes: 100, 200, 500, 1000, 2000, 5000, 10000
- Density: 10% (configurable)
- Warmup Runs: 2
- Timing Runs: 5 (averaged)
- Execution Time: Time to complete SpMV operation
- Speedup: S = T_serial / T_parallel
- Efficiency: E = Speedup / num_threads
- Scalability: Performance across problem sizes
make all # Build all programs
make validate # Build validation program
make benchmark # Build benchmark program
make test # Run validation tests
make bench # Run benchmarks
make plot # Generate graphs
make run # Full workflow (build + test + bench + plot)
make clean # Remove build artifacts
make rebuild # Clean and rebuild- Row-wise parallelization ensures no race conditions
- Each thread computes independent rows
- No synchronization needed within main computation loop
- CSR Format: Minimizes memory usage and cache misses
- Static Scheduling: Balanced workload distribution
- Compiler Optimization:
-O3flag for maximum performance - Loop Vectorization: Enabled by modern compilers
- Performance scales well with matrix size
- Efficiency depends on number of threads and problem size
- Small matrices may show overhead from parallelization
- Large matrices demonstrate significant speedup
Modify the benchmark configuration in benchmark.cpp:
std::vector<int> sizes = {100, 500, 1000, 5000, 10000};
double density = 0.05; // 5% densitySet OpenMP environment variable:
export OMP_NUM_THREADS=4
./benchmarkOr modify in code:
int num_threads = 4;
spmv_parallel(matrix, vector, num_threads);If you get OpenMP errors:
# Ubuntu/Debian
sudo apt-get install libomp-dev
# macOS (using Homebrew)
brew install libompIf matplotlib is missing:
pip3 install --user matplotlib pandas numpy- CSR Format: Saad, Y. (2003). Iterative Methods for Sparse Linear Systems
- OpenMP: OpenMP Architecture Review Board. OpenMP Application Programming Interface
- Performance Analysis: Foster, I. (1995). Designing and Building Parallel Programs