α · Research · Interpretability & alignment infrastructure

Interpretability & alignment infrastructure.

The cross-cutting axis that supplies tooling and theory to the other three. Mechanistic circuit analysis at frontier scale, scalable oversight for agentic systems, formal verification of learned policies, and the operational trust protocols that make cell pairing actually work.

Active cells 4
2025 published reports 6
Axis steward Karima Belkadi (term ends 2026-Q2)
Compute commitment 22% of pool

Position

Interpretability and alignment at alphabell are not separate axes. They are bundled because they share most of their methodological substrate and almost all of their production tooling — and because the labelling distinction has not, in our experience, helped anyone do better work.

More importantly, the interpretability axis at alphabell is infrastructural. It is not a parallel research track that publishes occasionally about why other tracks are concerning. It builds the tooling that the other three axes use, day to day, to do their own work. The mechanistic circuit-analysis library is consumed by every other axis. The debate-plus-trace oversight protocol is consumed by the agentic axis. The formal verification framework is consumed by world-models cells operating in bounded environments. The paired-cell protocol is consumed by the RSI axis.

Research threads

  • Mechanistic circuit analysis at frontier scale. Tooling that operates on frontier-class models without quadratic cost in attention-head enumeration. ab-circuits 1.2 is the current open release; cell hilbert-13 is the steward.
  • Scalable oversight for agentic systems. Debate-based oversight extended to multi-step agent execution, leveraging the structured execution traces emitted by the agentic-engineering substrate. The debate-plus-trace paper (25/12) is the reference.
  • Formal verification of learned policies. Combining abstract interpretation of the policy network with symbolic execution of the environment dynamics. Works in bounded environments today; the cell roadmap targets weakening the boundedness requirement.
  • Specification refinement from traces. Inferring formal specifications from observed execution traces, so that future runs can be checked against specifications nobody has yet had to write by hand.
  • Paired-cell protocol. The operational machinery — trust model, artefact pipeline, disagreement procedure, escalation channels — that makes interpretability-cell pairing actually function as a check rather than a ceremony. Reference: 24/15.

Tooling we maintain

  • ab-circuits — frontier-scale mechanistic interpretability library, MIT licensed.
  • ab-trace — content-addressed execution-trace store consumed by every substrate-hosted agent.
  • ab-debate — protocol harness for debate-based oversight, including the debate-plus-trace variant for agentic systems.
  • ab-verify — formal-verification toolkit for learned policies in bounded environments. Apache 2.0.
  • ab-pairs — the operational tooling used to maintain the interpretability-cell pairing relationship (read-access agreements, artefact subscriptions, disagreement-handling state machine).

How pairing works in practice

When an axis-2 or axis-3 cell decides to begin work on a dual-use capability (per the charter's dual-use definition), it requests a pairing through the lab governance protocol. The pairing call goes to interpretability cells with bandwidth, ranked by recency of relevant work. A successful pairing produces a signed pairing record specifying: the read-access scope (checkpoint frequency, trace coverage), the disagreement protocol, the halt-call authority, and the duration. Pairings are revisited every six months.

Pairings have, in protocol history, called five halts. Three of those halts were RSI runs (see RSI axis). The other two were agentic-axis sandboxed self-modification experiments that produced traces the paired interpretability cells could not satisfactorily reconstruct. In all five cases, the paired cell's halt was honoured without override.

Where this connects

Everywhere. That is the axis's job.

Active cells under this axis

hilbert-13
Frontier-scale mech-interp
ab-circuits 1.2
Interp. & align.
6 contributors
active
lebesgue-22
Scalable oversight + verif.
Debate-plus-trace v2
Interp. & align.
5 contributors
active
noether-12
Symmetry-aware policy verif.
Cross-cell with lebesgue-22
Interp. & align.
3 contributors
active
cantor-18
Specification refinement
From-traces specifications
Interp. & align.
3 contributors
active

All cells →

Publications under this axis

All publications →