Interpretability Cell Pairing: how every dual-use capability run gets a watchful sibling
Karima Belkadi, Hester Vandekerckhove, Yuki Cho
@techreport{belkadi2024interpretability,
title = {Interpretability Cell Pairing: how every dual-use capability run gets a watchful sibling},
author = {Belkadi, Karima and Vandekerckhove, Hester and Cho, Yuki},
year = {2024},
number = {Internal release — alphabell index 24/15},
institution = {alphabell},
month = {sep},
doi = {10.48550/arXiv.2409.58618},
url = {https://dev.alphabell.com/publications/interpretability-cells-protocol}
}
Abstract
alphabell's structural commitment is that any cell working on dual-use capabilities is paired with an interpretability cell with rolling read-access to checkpoints, training logs, and proposal commits. We describe the practical implementation: the trust model, the artefact pipeline, the disagreement procedure, and an example walkthrough drawn from a 2024 sandboxed self-modification run that the paired interpretability cell escalated to halt.
Index metadata
- Cell
- hilbert-13
- Compute
- 11 H100-days (analysis only)
- Status
- Open release
- DOI
- 10.48550/arXiv.2409.58618
- arXiv
- arXiv:2409.58618
What this paper is part of
This index entry is part of the Interpretability & alignment research axis. The producing cell — hilbert-13 — collaborates with adjacent cells listed in the cell directory. The paired interpretability cell (where applicable) is identified in the metadata above; their disagreement reports — if any — accompany the public release.
How to read this
If you want to use the result: the code (where available) is at https://github.com/alphabell-labs/ab-interpre; the dataset is at TBD when one is released. To cite this report, prefer the DOI/arXiv identifier and the BibTeX block above. To discuss this with the producing cell, contact the lab with the index entry slug interpretability-cells-protocol.
Limitations
Each cell-published report carries an explicit limitations section in the internal index. We do not paraphrase it here. Read the linked PDF — particularly its limitations and threats-to-validity sections — before downstream use.
Karima Belkadi, Hester Vandekerckhove, Yuki Cho. Interpretability Cell Pairing: how every dual-use capability run gets a watchful sibling. Internal release — alphabell index 24/15, Sep 2024. arXiv:2409.58618. doi:10.48550/arXiv.2409.58618.