conf
Interpretability & alignment infrastructure.
The cross-cutting axis that supplies tooling and theory to the other three. Mechanistic circuit analysis at frontier scale, scalable oversight for agentic systems, formal verification of learned policies, and the operational trust protocols that make cell pairing actually work.
Position
Interpretability and alignment at alphabell are not separate axes. They are bundled because they share most of their methodological substrate and almost all of their production tooling — and because the labelling distinction has not, in our experience, helped anyone do better work.
More importantly, the interpretability axis at alphabell is infrastructural. It is not a parallel research track that publishes occasionally about why other tracks are concerning. It builds the tooling that the other three axes use, day to day, to do their own work. The mechanistic circuit-analysis library is consumed by every other axis. The debate-plus-trace oversight protocol is consumed by the agentic axis. The formal verification framework is consumed by world-models cells operating in bounded environments. The paired-cell protocol is consumed by the RSI axis.
Research threads
- Mechanistic circuit analysis at frontier scale. Tooling that operates on frontier-class models without quadratic cost in attention-head enumeration. ab-circuits 1.2 is the current open release; cell hilbert-13 is the steward.
- Scalable oversight for agentic systems. Debate-based oversight extended to multi-step agent execution, leveraging the structured execution traces emitted by the agentic-engineering substrate. The debate-plus-trace paper (25/12) is the reference.
- Formal verification of learned policies. Combining abstract interpretation of the policy network with symbolic execution of the environment dynamics. Works in bounded environments today; the cell roadmap targets weakening the boundedness requirement.
- Specification refinement from traces. Inferring formal specifications from observed execution traces, so that future runs can be checked against specifications nobody has yet had to write by hand.
- Paired-cell protocol. The operational machinery — trust model, artefact pipeline, disagreement procedure, escalation channels — that makes interpretability-cell pairing actually function as a check rather than a ceremony. Reference: 24/15.
Tooling we maintain
- ab-circuits — frontier-scale mechanistic interpretability library, MIT licensed.
- ab-trace — content-addressed execution-trace store consumed by every substrate-hosted agent.
- ab-debate — protocol harness for debate-based oversight, including the debate-plus-trace variant for agentic systems.
- ab-verify — formal-verification toolkit for learned policies in bounded environments. Apache 2.0.
- ab-pairs — the operational tooling used to maintain the interpretability-cell pairing relationship (read-access agreements, artefact subscriptions, disagreement-handling state machine).
How pairing works in practice
When an axis-2 or axis-3 cell decides to begin work on a dual-use capability (per the charter's dual-use definition), it requests a pairing through the lab governance protocol. The pairing call goes to interpretability cells with bandwidth, ranked by recency of relevant work. A successful pairing produces a signed pairing record specifying: the read-access scope (checkpoint frequency, trace coverage), the disagreement protocol, the halt-call authority, and the duration. Pairings are revisited every six months.
Pairings have, in protocol history, called five halts. Three of those halts were RSI runs (see RSI axis). The other two were agentic-axis sandboxed self-modification experiments that produced traces the paired interpretability cells could not satisfactorily reconstruct. In all five cases, the paired cell's halt was honoured without override.
Where this connects
Everywhere. That is the axis's job.
Active cells under this axis
Publications under this axis
conf
conf
Mechanistic Markers of Planning Depth in Language-Model Agents
conf
Cross-Cell Replication of the 700-Circuit Conjecture
conf
Steering Vectors as a Lightweight Alternative to Activation Patching
preprint
Scalable Oversight for Multi-Step Agent Systems: a Debate-Plus-Trace Approach
conf
Latent Goal Decoding via Sparse Probes
preprint
Mechanistic Circuit Analysis at Frontier Scale: cells as a unit of interpretability
internal
Toward Formal Verification of Learned Policies in Bounded Environments
conf
Latent Trajectory Surgery: Editing Agent Plans in Mid-Run
conf
Federated Trace Auditing Without Centralizing Logs
internal
Interpretability Cell Pairing: how every dual-use capability run gets a watchful sibling
preprint