MoE Interpretability | Jacopo Zacchigna

This project adapts the HeadPursuit framework (originally designed for attention heads) to Mixture-of-Experts language models.

Target models: allenai/OLMoE-1B-7B-0924-Instruct (16 layers, 64 experts/layer, top-8 routing) and openai/gpt-oss-20b.

Pipeline:

Capture: use nnsight to extract per-expert activations with multi-GPU tensor-parallelism via torchrun. Activations and metadata stored in HDF5.
Pursuit: run Simultaneous Orthogonal Matching Pursuit (SOMP) over expert activations against the model’s unembedding dictionary, ranking top-k concept atoms per expert with Explained Variance Ratios (EVR).
Validation: cross-check with mean-projection baselines, per-token frequency analysis, and PCA singular-value spectra to separate monosemantic from polysemantic experts.

The output is an EVR heatmap across layers × experts, plus per-expert top-token lists for qualitative interpretation.

Repository currently private. Available on request.