Latent Goal Decoding via Sparse Probes
Jiang Yifei, Karima Belkadi, Wen Shao
@inproceedings{jiang2025latent,
title = {Latent Goal Decoding via Sparse Probes},
author = {Yifei, Jiang and Belkadi, Karima and Shao, Wen},
year = {2025},
booktitle = {ICML 2025 · Alignment Forum cross-post},
month = {jul},
doi = {10.48550/arXiv.2507.02614},
url = {https://dev.alphabell.com/publications/latent-goal-decoding-sparse-probes}
}
Abstract
We show that a small (≤1k-parameter) sparse linear probe trained on residual-stream activations recovers the deployed goal of a substrate-hosted agent to within an F1 of 0.91 across the alphabell benchmark of 38 long-horizon tasks. The probe is robust to paraphrase of the initial goal specification, transfers across substrate versions without retraining, and produces an interpretable goal-vector decomposition that the producing cell's contributors verified by hand on 200 sampled trajectories. We release the probe family and the goal-decomposition tooling as part of ab-probes.
Index metadata
- Cell
- hilbert-13
- Compute
- 31 H100-days
- Status
- Open release
- Code
- github.com/alphabell-labs/ab-probes
- DOI
- 10.48550/arXiv.2507.02614
- arXiv
- arXiv:2507.02614
What this paper is part of
This index entry is part of the Interpretability & alignment research axis. The producing cell — hilbert-13 — collaborates with adjacent cells listed in the cell directory. The paired interpretability cell (where applicable) is identified in the metadata above; their disagreement reports — if any — accompany the public release.
How to read this
If you want to use the result: the code (where available) is at https://github.com/alphabell-labs/ab-probes; the dataset is at https://huggingface.co/datasets/alphabell/probes-2025 when one is released. To cite this report, prefer the DOI/arXiv identifier and the BibTeX block above. To discuss this with the producing cell, contact the lab with the index entry slug latent-goal-decoding-sparse-probes.
Limitations
Each cell-published report carries an explicit limitations section in the internal index. We do not paraphrase it here. Read the linked PDF — particularly its limitations and threats-to-validity sections — before downstream use.
Jiang Yifei, Karima Belkadi, Wen Shao. Latent Goal Decoding via Sparse Probes. ICML 2025 · Alignment Forum cross-post, Jul 2025. arXiv:2507.02614. doi:10.48550/arXiv.2507.02614.