Self-Attention Kernels: High-Performance Computing

This project implements Causal Multi-Head Self-Attention (CMHSA) with three optimized backends, designed for high-performance deep learning inference:

Single-threaded CPU - Baseline implementation with SIMD optimizations
Multi-threaded CPU - OpenMP parallelization with tiled memory access patterns
CUDA GPU - Hand-optimized kernels achieving 1.09x speedup vs PyTorch naive on A100 GPUs

Performance Highlights

Benchmarks on Cineca (A100 GPUs) and Orfeo HPC clusters demonstrate significant speedups through iterative optimization:

CUDA v4.6: 1.09× faster than PyTorch naive baseline on A100
Strong Scaling: Near-linear scaling up to 128 threads on CPU
Multi-threaded: 8.11× speedup vs single-thread on 128 cores

Technical Stack

Languages: CUDA, C++, C, Python
Libraries: OpenMP, PyTorch (for validation)
Hardware: NVIDIA A100 GPUs, x86_64 multi-core CPUs
Testing: Validated against GPT-2 attention layer outputs

This work bridges the gap between deep learning algorithms and systems-level optimization, demonstrating expertise in both ML theory and high-performance computing.

🔗 GitHub Repository: Self_Attention_Kernels
📊 Performance Charts: See repository for detailed benchmarks on Cineca A100 and Orfeo clusters

CUDA kernel performance comparison on Orfeo cluster showing progressive optimization from v3 to v6

Strong scaling results demonstrating near-linear speedup across 1-128 threads

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Jacopo Zacchigna

Performance Highlights

Technical Stack

Share on