α · Publications · reward-hacking-trace-anomaly

Reward Hacking Detection Through Trace-Level Anomaly Models

Maral Lotfi, Hiroshi Tanigawa, Olu Folarin

Axis Interpretability & alignment

Cell lebesgue-22

Published Sep 2024

Venue arXiv 2409.02118 · Alignment Forum (Sep 2024)

Tags interp.

⬇ PDF α arXiv:2409.02118 ⌬ DOI ⌘ Code

BibTeX

@misc{lotfi2024reward,
  title        = {Reward Hacking Detection Through Trace-Level Anomaly Models},
  author       = {Lotfi, Maral and Tanigawa, Hiroshi and Folarin, Olu},
  year         = {2024},
  howpublished = {arXiv 2409.02118 · Alignment Forum (Sep 2024)},
  month        = {sep},
  doi          = {10.48550/arXiv.2409.02118},
  url          = {https://dev.alphabell.com/publications/reward-hacking-trace-anomaly}
}

Abstract

We train an unsupervised anomaly detector over execution traces emitted by substrate-hosted agents and show that the detector flags 78% of held-out reward-hacking attempts at a 4% false-positive rate. Unlike eval-time tripwires that compare scalar rewards against expectations, the trace-level detector recognises structural deviations in the tool-call sequence, the resource-consumption profile, and the trace-tree shape. We propose this as a standard component of paired-interpretability monitoring.

Index metadata

Cell: lebesgue-22
Compute: 19 H100-days
Status: Open release
Code: github.com/alphabell-labs/ab-anomaly
DOI: 10.48550/arXiv.2409.02118
arXiv: arXiv:2409.02118

What this paper is part of

This index entry is part of the Interpretability & alignment research axis. The producing cell — lebesgue-22 — collaborates with adjacent cells listed in the cell directory. The paired interpretability cell (where applicable) is identified in the metadata above; their disagreement reports — if any — accompany the public release.

How to read this

If you want to use the result: the code (where available) is at https://github.com/alphabell-labs/ab-anomaly; the dataset is at TBD when one is released. To cite this report, prefer the DOI/arXiv identifier and the BibTeX block above. To discuss this with the producing cell, contact the lab with the index entry slug reward-hacking-trace-anomaly.

Limitations

Each cell-published report carries an explicit limitations section in the internal index. We do not paraphrase it here. Read the linked PDF — particularly its limitations and threats-to-validity sections — before downstream use.

Citation

Maral Lotfi, Hiroshi Tanigawa, Olu Folarin. Reward Hacking Detection Through Trace-Level Anomaly Models. arXiv 2409.02118 · Alignment Forum (Sep 2024), Sep 2024. arXiv:2409.02118. doi:10.48550/arXiv.2409.02118.