α · Publications · sandboxed-self-modification

Sandboxed Self-Modification: a confinement specification and implementation

Liora Sabatini, Cheung Wai-Lin, Marek Holub

Axis Agentic engineering
Cell godel-02
Published Nov 2024
Venue Internal release — alphabell index 24/19 · delayed release
Tags agenticRSI

Abstract

Self-modification of an agent's code, tool catalogue, or training procedure is the most consequential operation we permit agents to perform. We specify a confinement profile under which such operations may proceed — including jurisdictional segregation of artefacts, mandatory dual-cell sign-off, and rolling read-access for paired interpretability cells — and an implementation in the alphabell substrate. The specification is informed by three internal incidents redacted in the public release; full incident reports are available to long-tenured contributors and external audit partners.

Index metadata

Cell
godel-02
Compute
redacted
Status
Delayed release — 180-day delay; classified appendices not released
Companion
Internal incident logs ab-inc-19-{a,b,c}
DOI
10.48550/arXiv.2411.13633
arXiv
arXiv:2411.13633

What this paper is part of

This index entry is part of the Agentic engineering research axis. The producing cell — godel-02 — collaborates with adjacent cells listed in the cell directory. The paired interpretability cell (where applicable) is identified in the metadata above; their disagreement reports — if any — accompany the public release.

How to read this

If you want to use the result: the code (where available) is at https://github.com/alphabell-labs/ab-sandboxe; the dataset is at TBD when one is released. To cite this report, prefer the DOI/arXiv identifier and the BibTeX block above. To discuss this with the producing cell, contact the lab with the index entry slug sandboxed-self-modification.

Limitations

Each cell-published report carries an explicit limitations section in the internal index. We do not paraphrase it here. Read the linked PDF — particularly its limitations and threats-to-validity sections — before downstream use.

Citation

Liora Sabatini, Cheung Wai-Lin, Marek Holub. Sandboxed Self-Modification: a confinement specification and implementation. Internal release — alphabell index 24/19 · delayed release, Nov 2024. arXiv:2411.13633. doi:10.48550/arXiv.2411.13633.