Skip to content

A high-performance implementation of Sparse Matrix-Vector Multiplication in C++ with serial, parallel (OpenMP), and GPU-accelerated (CUDA) versions, demonstrating the performance benefits of parallelism across different architectures.

Notifications You must be signed in to change notification settings

ranimeshehata/Sparse-Matrix-Vector-Multiplication

Repository files navigation

Sparse Matrix-Vector Multiplication (SpMV)

A high-performance implementation of Sparse Matrix-Vector Multiplication in C++ with serial, parallel (OpenMP), and GPU-accelerated (CUDA) versions, demonstrating the performance benefits of parallelism across different architectures.

πŸ“‹ Project Overview

This project implements SpMV using the Compressed Sparse Row (CSR) format, which is memory-efficient and optimized for sparse matrix operations. The implementation includes:

  • Serial Implementation: Sequential SpMV algorithm (CPU baseline)
  • OpenMP Parallel Implementation: CPU multi-core parallel SpMV with row-wise parallelization
  • CUDA Parallel Implementation: GPU-accelerated SpMV using NVIDIA CUDA (NEW! ⚑)
  • Validation Suite: Comprehensive testing against dense matrix multiplication
  • Performance Benchmarking: Detailed performance analysis with speedup and efficiency metrics
  • Visualization: Python scripts to generate performance graphs

πŸ—οΈ Project Structure

.
β”œβ”€β”€ SparseMatrix.h          # Sparse matrix class header (CSR format)
β”œβ”€β”€ SparseMatrix.cpp        # Sparse matrix implementation
β”œβ”€β”€ serial_spmv.h           # Serial SpMV header
β”œβ”€β”€ serial_spmv.cpp         # Serial SpMV implementation
β”œβ”€β”€ parallel_spmv.h         # Parallel SpMV header (OpenMP)
β”œβ”€β”€ parallel_spmv.cpp       # Parallel SpMV implementation (OpenMP)
β”œβ”€β”€ cuda_spmv.cuh           # CUDA SpMV header (GPU)
β”œβ”€β”€ cuda_spmv.cu            # CUDA SpMV implementation (GPU)
β”œβ”€β”€ validate.cpp            # Validation test program (Serial + OpenMP)
β”œβ”€β”€ validate_cuda.cpp       # Validation with CUDA tests
β”œβ”€β”€ benchmark.cpp           # Performance benchmarking (Serial + OpenMP)
β”œβ”€β”€ benchmark_cuda.cpp      # Performance benchmarking (with CUDA)
β”œβ”€β”€ plot_results.py         # Visualization script (Serial + OpenMP)
β”œβ”€β”€ plot_cuda_results.py    # Visualization script (with CUDA)
β”œβ”€β”€ Makefile                # Build automation (Serial + OpenMP)
β”œβ”€β”€ Makefile.cuda           # Build automation (with CUDA)
β”œβ”€β”€ CUDA_GUIDE.md           # CUDA implementation guide
└── README.md               # This file

πŸ”§ Requirements

CPU-Only Build Requirements

  • C++ Compiler: GCC 7.0+ or Clang 6.0+ with C++17 support
  • OpenMP: For parallel implementation (usually included with GCC)
  • Make: For build automation

CUDA Build Requirements (Optional - for GPU acceleration)

  • NVIDIA GPU: CUDA Compute Capability 6.0 or higher
  • CUDA Toolkit: Version 11.0 or higher
  • NVIDIA Drivers: Latest recommended
  • See CUDA_GUIDE.md for detailed setup instructions

Visualization Requirements (Optional)

  • Python 3.6+
  • Libraries: matplotlib, pandas, numpy

Install Python dependencies:

pip3 install matplotlib pandas numpy

πŸš€ Quick Start (CPU Version)

1. Build the Project

make all

This compiles both the validation and benchmark programs.

2. Run Validation Tests

make test
# or
./validate

This validates the correctness of both serial and parallel implementations.

3. Run Performance Benchmarks

make bench
# or
./benchmark

This runs performance tests on multiple matrix sizes and generates benchmark_results.csv.

4. Generate Performance Graphs

make plot
# or
python3 plot_results.py

This creates graphs in the results/ directory:

  • execution_time.png - Execution time comparison
  • speedup.png - Speedup vs problem size
  • efficiency.png - Parallel efficiency
  • execution_time_log.png - Log-scale comparison

5. Complete Workflow

make run

This builds, tests, benchmarks, and generates plots in one command.

⚑ CUDA GPU Version (Optional)

Prerequisites

  • NVIDIA GPU with CUDA support
  • CUDA Toolkit installed

Build and Run

# Build CUDA programs
make -f Makefile.cuda all

# Validate correctness
make -f Makefile.cuda test

# Run benchmarks
make -f Makefile.cuda bench

# Generate graphs
make -f Makefile.cuda plot

# Complete workflow
make -f Makefile.cuda run

CUDA Performance Graphs

Generated in results_cuda/ directory:

  • execution_time_cuda.png - Serial vs OpenMP vs CUDA comparison
  • speedup_comparison.png - OpenMP and CUDA speedups
  • performance_bars.png - Bar chart comparison
  • execution_time_log.png - Log-scale performance
  • cuda_vs_openmp.png - CUDA vs OpenMP relative performance

For detailed CUDA instructions, see CUDA_GUIDE.md

πŸ“Š Algorithm Design

CSR Format

The Compressed Sparse Row format stores only non-zero elements:

  • values: Array of non-zero values
  • col_indices: Column index for each value
  • row_ptr: Starting index of each row in values array

Serial Algorithm

for each row i:
    result[i] = 0
    for j from row_ptr[i] to row_ptr[i+1]:
        result[i] += values[j] * vector[col_indices[j]]

Parallel Algorithm (OpenMP)

#pragma omp parallel for
for each row i:
    result[i] = 0
    for j from row_ptr[i] to row_ptr[i+1]:
        result[i] += values[j] * vector[col_indices[j]]

Key advantages of row-wise parallelization:

  • No race conditions (each thread writes to different result elements)
  • Good load balancing with static scheduling
  • Cache-friendly access pattern

CUDA Algorithm (GPU)

__global__ void spmv_kernel() {
    row = blockIdx.x * blockDim.x + threadIdx.x
    
    if (row < num_rows) {
        sum = 0.0
        for j from row_ptr[row] to row_ptr[row+1]:
            sum += values[j] * vector[col_indices[j]]
        result[row] = sum
    }
}

Key advantages of GPU parallelization:

  • Massive parallelism (thousands of threads)
  • Optimized for compute-intensive workloads
  • Shared memory caching for small vectors
  • Excellent for large-scale problems

πŸ§ͺ Validation

The validation suite tests both implementations on:

  1. Small predefined matrices (4Γ—4)
  2. Random sparse matrices (100Γ—100, 10% density)
  3. Very sparse matrices (200Γ—200, 2% density)
  4. Diagonal matrices (5Γ—5)

All results are compared against ground truth dense matrix multiplication with tolerance of 1e-10.

πŸ“ˆ Performance Evaluation

Benchmark Configuration

  • Matrix Sizes: 100, 200, 500, 1000, 2000, 5000, 10000
  • Density: 10% (configurable)
  • Warmup Runs: 2
  • Timing Runs: 5 (averaged)

Metrics

  • Execution Time: Time to complete SpMV operation
  • Speedup: S = T_serial / T_parallel
  • Efficiency: E = Speedup / num_threads
  • Scalability: Performance across problem sizes

πŸ› οΈ Makefile Targets

make all        # Build all programs
make validate   # Build validation program
make benchmark  # Build benchmark program
make test       # Run validation tests
make bench      # Run benchmarks
make plot       # Generate graphs
make run        # Full workflow (build + test + bench + plot)
make clean      # Remove build artifacts
make rebuild    # Clean and rebuild

πŸ” Implementation Details

Thread Safety

  • Row-wise parallelization ensures no race conditions
  • Each thread computes independent rows
  • No synchronization needed within main computation loop

Optimization Techniques

  1. CSR Format: Minimizes memory usage and cache misses
  2. Static Scheduling: Balanced workload distribution
  3. Compiler Optimization: -O3 flag for maximum performance
  4. Loop Vectorization: Enabled by modern compilers

Scalability Considerations

  • Performance scales well with matrix size
  • Efficiency depends on number of threads and problem size
  • Small matrices may show overhead from parallelization
  • Large matrices demonstrate significant speedup

πŸ“ Usage Examples

Running with Custom Parameters

Modify the benchmark configuration in benchmark.cpp:

std::vector<int> sizes = {100, 500, 1000, 5000, 10000};
double density = 0.05;  // 5% density

Controlling Thread Count

Set OpenMP environment variable:

export OMP_NUM_THREADS=4
./benchmark

Or modify in code:

int num_threads = 4;
spmv_parallel(matrix, vector, num_threads);

πŸ› Troubleshooting

OpenMP Not Found

If you get OpenMP errors:

# Ubuntu/Debian
sudo apt-get install libomp-dev

# macOS (using Homebrew)
brew install libomp

Python Plotting Issues

If matplotlib is missing:

pip3 install --user matplotlib pandas numpy

πŸ“š References

  1. CSR Format: Saad, Y. (2003). Iterative Methods for Sparse Linear Systems
  2. OpenMP: OpenMP Architecture Review Board. OpenMP Application Programming Interface
  3. Performance Analysis: Foster, I. (1995). Designing and Building Parallel Programs

About

A high-performance implementation of Sparse Matrix-Vector Multiplication in C++ with serial, parallel (OpenMP), and GPU-accelerated (CUDA) versions, demonstrating the performance benefits of parallelism across different architectures.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published