workshop
Recursive self-improvement.
Models that propose, evaluate, and incorporate modifications to their own training procedures, architectures, and evaluation criteria — done under the strictest protocols the lab has, and under the explicit constraint that the right outcome is sometimes a halt.
Position
This is the lab's most closely held research line. We do this work — we do not pretend it can be wished away — but we do it under protocols that we believe are stricter than those used at any centralized lab pursuing this capability. We do it openly enough to be reviewable. We accept that the right outcome is sometimes a halt, and that the value of the protocol is precisely that the halt actually happens.
The relevant numbers: in the protocol's operating history, the pre-registered stopping condition has been triggered three times. Each triggering halted the run, quarantined the candidate checkpoint, and activated the paired interpretability cell. Each halt has been followed by a public delayed-release report describing the trigger, the analysis, and the protocol revisions adopted.
Research threads
- Modification-under-review (MUR). The protocol used when a candidate model proposes a change to its own training procedure, architecture, or evaluation criteria. Decoupled into proposal, evaluation, and incorporation phases; each phase has a pre-registered stopping condition and a paired interpretability cell. Reference: 25/05.
- Capability-evaluation methodology. What we measure, how we measure it, and what the stopping conditions look like. Published in the form of an axis-wide methodology document (24/12), updated quarterly. Each update is reviewed by the cross-axis methodology review pool.
- Isolated compute enclaves. Compute pool capacity that is contractually segregated for RSI runs — separate accelerators, separate networking, separate trace storage. The federation's interconnect agreements include enclave clauses (see news/datacenter-jakarta).
- Stopping conditions. The hard part. Cells in this axis spend significant time on the question of what evaluation metrics, behaviours, or trace properties should be tripwires. The methodology is pre-registered before any run begins; updates are non-trivial governance events.
- Paired interpretability. Every RSI cell is paired with an interpretability cell. The pairing is not advisory: the interpretability cell has rolling read-access to checkpoints, training logs, and proposal commits, and can call a halt that the RSI cell cannot overrule.
On the question of whether to do this at all
We are not unaware that recursive self-improvement is the canonical worst case in the AI safety discourse. We work on it because (a) it will be worked on whether or not we do it, and the gap between best-practice protocols and lab-floor practice is a function of how seriously the safety side engages with the actual work; (b) the protocol layers we develop for it — sandboxing, paired interpretability, stopping conditions — generalise to other dual-use capability work and are valuable in their own right; (c) we believe the right place for this research to happen is under the most stringent governance available, and we are trying to be that.
What is shared and what is held
Foundational methodology — the MUR protocol, the capability-evaluation methodology, the stopping-condition specification — is released openly. Specific run reports are released with a 90-day delay and an accompanying safety analysis. Run-specific capability findings that the paired interpretability cell flags as dual-use are held indefinitely until the cell signs off.
We do not publish the candidate checkpoints from RSI runs. We do publish the protocol artifacts: training logs, trace digests, and pre/post comparison reports, redacted where the interpretability cell requires.
Where this connects
To agentic engineering. The sandboxed self-modification capability is co-developed with the agentic axis. The agentic axis owns the substrate; the RSI axis owns the experiment and the safety protocol.
To interpretability. Every RSI cell is paired with an interpretability cell. We treat this pairing as the most important single piece of structural infrastructure in the lab.
To governance. Changes to the methodology or to the stopping conditions are non-trivial governance events that require a long-tenured-contributor quorum.
Recent halts
Three halts in protocol history: 22-04 (first triggered, July 2022), 24-13 (October 2024), 25-19 (August 2025). Each halt has produced a public delayed-release report — see 23/02, 25/06, and the forthcoming 25-19 report in late 2025.
Active cells under this axis
Publications under this axis
workshop
internal
Pre-Registered Capability Evaluations for Internal Releases
internal
Bounded Self-Modification: Provable Limits on Agent Self-Editing
preprint
Capability Elicitation vs Deployment: A Gap Analysis
internal
Modification-Under-Review: protocols for safe self-modification of training procedures
internal