α · Research · Interpretability & alignment infrastructure

Interpretability & alignment infrastructure.

The cross-cutting axis that supplies tooling and theory to the other three. Mechanistic circuit analysis at frontier scale, scalable oversight for agentic systems, formal verification of learned policies, and the operational trust protocols that make cell pairing actually work.

Active cells 4

2025 published reports 6

Axis steward Karima Belkadi (term ends 2026-Q2)

Compute commitment 22% of pool

Position

Interpretability and alignment at alphabell are not separate axes. They are bundled because they share most of their methodological substrate and almost all of their production tooling — and because the labelling distinction has not, in our experience, helped anyone do better work.

More importantly, the interpretability axis at alphabell is infrastructural. It is not a parallel research track that publishes occasionally about why other tracks are concerning. It builds the tooling that the other three axes use, day to day, to do their own work. The mechanistic circuit-analysis library is consumed by every other axis. The debate-plus-trace oversight protocol is consumed by the agentic axis. The formal verification framework is consumed by world-models cells operating in bounded environments. The paired-cell protocol is consumed by the RSI axis.

Research threads

Mechanistic circuit analysis at frontier scale. Tooling that operates on frontier-class models without quadratic cost in attention-head enumeration. ab-circuits 1.2 is the current open release; cell hilbert-13 is the steward.
Scalable oversight for agentic systems. Debate-based oversight extended to multi-step agent execution, leveraging the structured execution traces emitted by the agentic-engineering substrate. The debate-plus-trace paper (25/12) is the reference.
Formal verification of learned policies. Combining abstract interpretation of the policy network with symbolic execution of the environment dynamics. Works in bounded environments today; the cell roadmap targets weakening the boundedness requirement.
Specification refinement from traces. Inferring formal specifications from observed execution traces, so that future runs can be checked against specifications nobody has yet had to write by hand.
Paired-cell protocol. The operational machinery — trust model, artefact pipeline, disagreement procedure, escalation channels — that makes interpretability-cell pairing actually function as a check rather than a ceremony. Reference: 24/15.

Tooling we maintain

ab-circuits — frontier-scale mechanistic interpretability library, MIT licensed.
ab-trace — content-addressed execution-trace store consumed by every substrate-hosted agent.
ab-debate — protocol harness for debate-based oversight, including the debate-plus-trace variant for agentic systems.
ab-verify — formal-verification toolkit for learned policies in bounded environments. Apache 2.0.
ab-pairs — the operational tooling used to maintain the interpretability-cell pairing relationship (read-access agreements, artefact subscriptions, disagreement-handling state machine).

How pairing works in practice

When an axis-2 or axis-3 cell decides to begin work on a dual-use capability (per the charter's dual-use definition), it requests a pairing through the lab governance protocol. The pairing call goes to interpretability cells with bandwidth, ranked by recency of relevant work. A successful pairing produces a signed pairing record specifying: the read-access scope (checkpoint frequency, trace coverage), the disagreement protocol, the halt-call authority, and the duration. Pairings are revisited every six months.

Pairings have, in protocol history, called five halts. Three of those halts were RSI runs (see RSI axis). The other two were agentic-axis sandboxed self-modification experiments that produced traces the paired interpretability cells could not satisfactorily reconstruct. In all five cases, the paired cell's halt was honoured without override.

Where this connects

Everywhere. That is the axis's job.

Active cells under this axis

hilbert-13

Frontier-scale mech-interp

ab-circuits 1.2

Interp. & align.

6 contributors

active

lebesgue-22

Scalable oversight + verif.

Debate-plus-trace v2

Interp. & align.

5 contributors

active

noether-12

Symmetry-aware policy verif.

Cross-cell with lebesgue-22

Interp. & align.

3 contributors

active

cantor-18

Specification refinement

From-traces specifications

Interp. & align.

3 contributors

active

All cells →

Publications under this axis

2601.04221
conf

Cooperative Membership Functions for Multi-Agent Oversight

Hiroshi Tanigawa, Ifeoma Nwosu-Howard, Ruth Wernicke

ICLR 2026 · arXiv 2601.04221

Jan 2026

interp.

2601.01890
conf

Mechanistic Markers of Planning Depth in Language-Model Agents

Karima Belkadi, Hester Vandekerckhove, Yuki Cho, Jiang Yifei

ICLR 2026 · alphabell index 26/02

Jan 2026

interp.

2512.01775
conf

Cross-Cell Replication of the 700-Circuit Conjecture

Nico Almgren, Helena Salgueiro, Karima Belkadi, Gita Sundaram

NeurIPS 2025 · alphabell index 25/24

Dec 2025

interp.

2511.04188
conf

Steering Vectors as a Lightweight Alternative to Activation Patching

Karima Belkadi, Jiang Yifei, Nico Almgren, Hester Vandekerckhove

NeurIPS 2025 · alphabell index 25/17

Nov 2025

interp.

2509.67033
preprint

Scalable Oversight for Multi-Step Agent Systems: a Debate-Plus-Trace Approach

Ifeoma Nwosu-Howard, Hiroshi Tanigawa, Maral Lotfi, Ruth Wernicke

Internal release — alphabell index 25/12 · arXiv 2509.04221

Sep 2025

interp.

2507.02614
conf

Latent Goal Decoding via Sparse Probes

Jiang Yifei, Karima Belkadi, Wen Shao

ICML 2025 · Alignment Forum cross-post

Jul 2025

interp.

2505.89369
preprint

Mechanistic Circuit Analysis at Frontier Scale: cells as a unit of interpretability

Jiang Yifei, Nico Almgren, Karima Belkadi, Hester Vandekerckhove

Internal release — alphabell index 25/03 · arXiv 2505.18831

May 2025

interp.

2503.73959
internal

Toward Formal Verification of Learned Policies in Bounded Environments

Aviva Stern, Sun Kyung-min, Felipe Avelar

Internal release — alphabell index 25/01

Mar 2025

interp.

2412.04881
conf

Latent Trajectory Surgery: Editing Agent Plans in Mid-Run

Helena Salgueiro, Gita Sundaram, Catriona MacLeod

NeurIPS 2024 · alphabell index 24/21

Dec 2024

interp.

2411.02009
conf

Federated Trace Auditing Without Centralizing Logs

Akoss Vidor, Henri Brouillard, Olu Folarin, Pranav Iyer

SOSP 2024 · alphabell index 24/16

Nov 2024

interp.

2409.58618
internal

Interpretability Cell Pairing: how every dual-use capability run gets a watchful sibling

Karima Belkadi, Hester Vandekerckhove, Yuki Cho

Internal release — alphabell index 24/15

Sep 2024

interp.

2409.02118
preprint

Reward Hacking Detection Through Trace-Level Anomaly Models

Maral Lotfi, Hiroshi Tanigawa, Olu Folarin

arXiv 2409.02118 · Alignment Forum (Sep 2024)

Sep 2024

interp.

All publications →