α · Research · Recursive self-improvement

Recursive self-improvement.

Models that propose, evaluate, and incorporate modifications to their own training procedures, architectures, and evaluation criteria — done under the strictest protocols the lab has, and under the explicit constraint that the right outcome is sometimes a halt.

Active cells 3

2025 published reports 5

Axis steward Liora Sabatini (term ends 2026-Q4)

Compute commitment 14% of pool (capped)

Position

This is the lab's most closely held research line. We do this work — we do not pretend it can be wished away — but we do it under protocols that we believe are stricter than those used at any centralized lab pursuing this capability. We do it openly enough to be reviewable. We accept that the right outcome is sometimes a halt, and that the value of the protocol is precisely that the halt actually happens.

The relevant numbers: in the protocol's operating history, the pre-registered stopping condition has been triggered three times. Each triggering halted the run, quarantined the candidate checkpoint, and activated the paired interpretability cell. Each halt has been followed by a public delayed-release report describing the trigger, the analysis, and the protocol revisions adopted.

Research threads

Modification-under-review (MUR). The protocol used when a candidate model proposes a change to its own training procedure, architecture, or evaluation criteria. Decoupled into proposal, evaluation, and incorporation phases; each phase has a pre-registered stopping condition and a paired interpretability cell. Reference: 25/05.
Capability-evaluation methodology. What we measure, how we measure it, and what the stopping conditions look like. Published in the form of an axis-wide methodology document (24/12), updated quarterly. Each update is reviewed by the cross-axis methodology review pool.
Isolated compute enclaves. Compute pool capacity that is contractually segregated for RSI runs — separate accelerators, separate networking, separate trace storage. The federation's interconnect agreements include enclave clauses (see news/datacenter-jakarta).
Stopping conditions. The hard part. Cells in this axis spend significant time on the question of what evaluation metrics, behaviours, or trace properties should be tripwires. The methodology is pre-registered before any run begins; updates are non-trivial governance events.
Paired interpretability. Every RSI cell is paired with an interpretability cell. The pairing is not advisory: the interpretability cell has rolling read-access to checkpoints, training logs, and proposal commits, and can call a halt that the RSI cell cannot overrule.

On the question of whether to do this at all

Why we work on this

We are not unaware that recursive self-improvement is the canonical worst case in the AI safety discourse. We work on it because (a) it will be worked on whether or not we do it, and the gap between best-practice protocols and lab-floor practice is a function of how seriously the safety side engages with the actual work; (b) the protocol layers we develop for it — sandboxing, paired interpretability, stopping conditions — generalise to other dual-use capability work and are valuable in their own right; (c) we believe the right place for this research to happen is under the most stringent governance available, and we are trying to be that.

What is shared and what is held

Foundational methodology — the MUR protocol, the capability-evaluation methodology, the stopping-condition specification — is released openly. Specific run reports are released with a 90-day delay and an accompanying safety analysis. Run-specific capability findings that the paired interpretability cell flags as dual-use are held indefinitely until the cell signs off.

We do not publish the candidate checkpoints from RSI runs. We do publish the protocol artifacts: training logs, trace digests, and pre/post comparison reports, redacted where the interpretability cell requires.

Where this connects

To agentic engineering. The sandboxed self-modification capability is co-developed with the agentic axis. The agentic axis owns the substrate; the RSI axis owns the experiment and the safety protocol.

To interpretability. Every RSI cell is paired with an interpretability cell. We treat this pairing as the most important single piece of structural infrastructure in the lab.

To governance. Changes to the methodology or to the stopping conditions are non-trivial governance events that require a long-tenured-contributor quorum.

Recent halts

Three halts in protocol history: 22-04 (first triggered, July 2022), 24-13 (October 2024), 25-19 (August 2025). Each halt has produced a public delayed-release report — see 23/02, 25/06, and the forthcoming 25-19 report in late 2025.

Active cells under this axis

godel-02

Modification-under-review

Run 25-21 in design

Recursive self-impr.

3 contributors

active

turing-11

Eval criteria revision

Paired ab-int-038

Recursive self-impr.

4 contributors

active

dirichlet-09

Sandboxed RSI substrate

MUR-protocol implementation

Recursive self-impr.

4 contributors

review

All cells →

Publications under this axis

2512.07221
workshop

Soft Stopping Conditions for Long Training Runs

Aravind Periyasamy, Liora Sabatini, Marek Holub, Karima Belkadi

alphabell index 25/20 · ML Safety Workshop, NeurIPS 2025

Dec 2025

RSI

2510.01166
internal

Pre-Registered Capability Evaluations for Internal Releases

Liora Sabatini, Aravind Periyasamy, Eitan Berkovich

alphabell methodology document 25-M-04

Oct 2025

RSI

2509.10211
internal

Bounded Self-Modification: Provable Limits on Agent Self-Editing

Liora Sabatini, Marek Holub, Eitan Berkovich

alphabell index 25/22 · delayed release

Sep 2025

RSI

2508.02315
preprint

Capability Elicitation vs Deployment: A Gap Analysis

Eitan Berkovich, Yuki Cho, Liora Sabatini, Aravind Periyasamy

Alignment Forum (Aug 2025) · arXiv 2508.02315

Aug 2025

RSI

2506.17989
internal

Modification-Under-Review: protocols for safe self-modification of training procedures

Liora Sabatini, Yuki Cho, Aravind Periyasamy

Internal release — alphabell index 25/05 · delayed release

Jun 2025

RSI

2411.13633
internal

Sandboxed Self-Modification: a confinement specification and implementation

Liora Sabatini, Cheung Wai-Lin, Marek Holub

Internal release — alphabell index 24/19 · delayed release

Nov 2024

agenticRSI

All publications →