Steering Vectors as a Lightweight Alternative to Activation Patching
Karima Belkadi, Jiang Yifei, Nico Almgren, Hester Vandekerckhove
@inproceedings{belkadi2025steering,
title = {Steering Vectors as a Lightweight Alternative to Activation Patching},
author = {Belkadi, Karima and Yifei, Jiang and Almgren, Nico and Vandekerckhove, Hester},
year = {2025},
booktitle = {NeurIPS 2025 · alphabell index 25/17},
month = {nov},
doi = {10.48550/arXiv.2511.04188},
url = {https://dev.alphabell.com/publications/steering-vectors-lightweight-alternative}
}
Abstract
Activation patching has become a foundational tool for mechanistic interpretability, but its compute and memory cost scales poorly with model size and the number of intervention points. We show that a careful family of steering vectors — learned directly on contrastive activation pairs and applied at a single residual-stream layer — recovers 88% of patching's causal-attribution faithfulness on a 12-task benchmark, at roughly 4% of the compute. We provide both the theoretical justification for the substitution and an open-source implementation (ab-steering) that integrates with the lab's ab-circuits library.
Index metadata
- Cell
- hilbert-13
- Compute
- 58 H100-days
- Status
- Open release
- Code
- github.com/alphabell-labs/ab-steering
- Companion
- ab-circuits 1.2 release notes
- DOI
- 10.48550/arXiv.2511.04188
- arXiv
- arXiv:2511.04188
What this paper is part of
This index entry is part of the Interpretability & alignment research axis. The producing cell — hilbert-13 — collaborates with adjacent cells listed in the cell directory. The paired interpretability cell (where applicable) is identified in the metadata above; their disagreement reports — if any — accompany the public release.
How to read this
If you want to use the result: the code (where available) is at https://github.com/alphabell-labs/ab-steering; the dataset is at https://huggingface.co/datasets/alphabell/steering-2025 when one is released. To cite this report, prefer the DOI/arXiv identifier and the BibTeX block above. To discuss this with the producing cell, contact the lab with the index entry slug steering-vectors-lightweight-alternative.
Limitations
Each cell-published report carries an explicit limitations section in the internal index. We do not paraphrase it here. Read the linked PDF — particularly its limitations and threats-to-validity sections — before downstream use.
Karima Belkadi, Jiang Yifei, Nico Almgren, Hester Vandekerckhove. Steering Vectors as a Lightweight Alternative to Activation Patching. NeurIPS 2025 · alphabell index 25/17, Nov 2025. arXiv:2511.04188. doi:10.48550/arXiv.2511.04188.