Self-Attention Kernels: High-Performance Computing

This project implements Causal Multi-Head Self-Attention (CMHSA) with three optimized backends, designed for high-performance deep learning inference:

  • Single-threaded CPU - Baseline implementation with SIMD optimizations
  • Multi-threaded CPU - OpenMP parallelization with tiled memory access patterns
  • CUDA GPU - Hand-optimized kernels achieving 1.09x speedup vs PyTorch naive on A100 GPUs

Performance Highlights

Benchmarks on Cineca (A100 GPUs) and Orfeo HPC clusters demonstrate significant speedups through iterative optimization:

  • CUDA v4.6: 1.09× faster than PyTorch naive baseline on A100
  • Strong Scaling: Near-linear scaling up to 128 threads on CPU
  • Multi-threaded: 8.11× speedup vs single-thread on 128 cores

Technical Stack

  • Languages: CUDA, C++, C, Python
  • Libraries: OpenMP, PyTorch (for validation)
  • Hardware: NVIDIA A100 GPUs, x86_64 multi-core CPUs
  • Testing: Validated against GPT-2 attention layer outputs

This work bridges the gap between deep learning algorithms and systems-level optimization, demonstrating expertise in both ML theory and high-performance computing.

🔗 GitHub Repository: Self_Attention_Kernels
📊 Performance Charts: See repository for detailed benchmarks on Cineca A100 and Orfeo clusters

CUDA kernel performance comparison on Orfeo cluster showing progressive optimization from v3 to v6

Strong scaling results demonstrating near-linear speedup across 1-128 threads