Self-Attention Kernels
Optimized Causal Multi-Head Self-Attention in CUDA, OpenMP, and SIMD — 1.09× faster than PyTorch naive on A100
Hand-optimized Causal Multi-Head Self-Attention (CMHSA) with three backends validated against GPT-2 attention outputs:
| Backend | Result |
|---|---|
| CUDA v4.6 | 1.09× faster than PyTorch naive on A100 |
| OpenMP (128 cores) | 8.11× speedup vs single-thread |
| Strong scaling | Near-linear to 128 threads |
Benchmarked on Cineca (A100 GPUs) and Orfeo HPC clusters. The CUDA path uses tiled shared-memory access patterns and progressive kernel optimization across six versions.
🔗 GitHub