MoE Interpretability
Adapting HeadPursuit / SOMP to classify expert specialization in Mixture-of-Experts LLMs
This project adapts the HeadPursuit framework — originally designed for attention heads — to Mixture-of-Experts language models.
Target models: allenai/OLMoE-1B-7B-0924-Instruct (16 layers, 64 experts/layer, top-8 routing) and openai/gpt-oss-20b.
Pipeline:
- Capture — use
nnsightto extract per-expert activations with multi-GPU tensor-parallelism viatorchrun. Activations and metadata stored in HDF5. - Pursuit — run Simultaneous Orthogonal Matching Pursuit (SOMP) over expert activations against the model’s unembedding dictionary, ranking top-k concept atoms per expert with Explained Variance Ratios (EVR).
- Validation — cross-check with mean-projection baselines, per-token frequency analysis, and PCA singular-value spectra to separate monosemantic from polysemantic experts.
The output is an EVR heatmap across layers × experts, plus per-expert top-token lists for qualitative interpretation.
Repository currently private — available on request.