Assessing Chain-of-Thought Monitorability in AI: A Critical View on Internal Reasoning Control
OpenAI introduced a framework to evaluate chain-of-thought (CoT) monitorability : whether a monitor can predict properties of an AI system’s behavior by analyzing observable signals such as the model’s chain-of-thought, rather than relying only on final answers and tool actions. The motivation is practical. As reasoning models become better at long-horizon tasks, tool use, and strategic problem solving, it becomes harder to supervise them with direct human review alone. OpenAI’s work focuses on how well we can measure monitorability across tasks and settings, and how that monitorability changes with more reasoning at inference time , reinforcement learning (RL) , and pretraining scale . TL;DR OpenAI defines monitorability as the ability of a monitor to predict properties of interest about an agent’s behavior. OpenAI introduces 13 evaluations across 24 environments , grouped into three archetypes: intervention , process , and outcome-property . OpenAI ...