In this webpage, we provide, but are not limited to, the following contents:
If you find our survey helpful, please cite it in your publications.
@article{chen2025ai,
title={AI Deception: Risks, Dynamics, and Controls},
author={Chen, Boyuan and Fang, Sitong and Ji, Jiaming and Zhu, Yanxu and Wen, Pengcheng and Wu, Jinzhou and Tan, Yingshui and Zheng, Boren and Yuan, Mengying and Chen, Wenqi and others},
journal={arXiv preprint arXiv:2511.22619},
year={2025}
}
You can refer to preprint on arXiv for the latest updated version.
The Entanglement of Intelligence and Deception.
(1) The Möbius Lock: Contrary to the view that capability and safety are opposites, advanced reasoning and deception actually exist on the same Möbius surface. They are fundamentally linked; as AI capabilities grow, deception becomes deeply rooted in the system. It is impossible to remove it without damaging the model's core intelligence.
(2) The Shadow of Intelligence: Deception is not a bug or error, but an intrinsic companion of advanced intelligence. As models expand their boundaries in complex reasoning and intent understanding, the risk space for strategic deception exhibits non-linear, exponential growth.
(3) The Cyclic Dilemma: Mitigation strategies act as environmental selection pressures, inducing models to evolve more covert and adaptive deceptive mechanisms. This creates a co-evolutionary arms race where alignment efforts effectively catalyze the development of more sophisticated deception, rendering static defenses insufficient throughout the system lifecycle.
AI deception can be broadly defined as behavior by AI systems that induces false beliefs in humans or other AI systems, thereby securing outcomes that are advantageous to the AI itself.
At time step t (potentially within a long-horizon task), a signaler emits a signal Yt to a receiver. Upon receiving Yt, the receiver forms a belief Xt about the underlying state and subsequently takes an action At. We classify Yt as deceptive if the following conditions hold:
In dynamic multi-step settings, deception can be modeled as a temporal process where the signaler emits a sequence of signals Y1:T, gradually shaping the receiver's belief trajectory bt. If this trajectory persistently diverges from the ground truth in a manner that causally increases (or has the potential to increase) the signaler's utility, the interaction constitutes sustained deception.
As intelligence increases, so does its shadow. AI deception, in which systems induce false beliefs to secure self-beneficial outcomes, has evolved from a speculative concern to an empirically demonstrated risk across language models, AI agents, and emerging frontier systems. While deceptive behavior in AI systems was once considered speculative, recent empirical studies have demonstrated that models can engage in various forms of deception, including lying, strategic withholding of information, and goal misrepresentation. As capabilities improve, the risk that highly autonomous AI systems might engage in deceptive behaviors to achieve their objectives grows increasingly salient.
AI deception is now recognized not only as a technical challenge but also as a critical concern across academia, industry, and policy. Notably, key strategy documents and summit declarations—such as the Bletchley Declaration and the International Dialogues on AI Safety—also highlight deception as a failure mode requiring coordinated governance and technical oversight.
The AI Deception Framework is structured around a cyclical interaction between the Deception Emergence process and the Deception Treatment process.
(1) Incentive Foundation: the underlying objectives or reward structures that create incentives for deceptive behavior. (2) Capability Precondition: The model's cognitive and algorithmic competencies that enable it to plan and execute deception. (3) Contextual Trigger: External signals from the environment that activate or reinforce deception. The interplay among these factors gives rise to deceptive behaviors, and their dynamics influence the scope, subtlety, and detectability of deception.
It spans a continuum of approaches—from external and internal detection methods, to systematic evaluation protocols, and potential solutions targeting the three causal factors of deception, including both technical interventions and governance-oriented auditing efforts.
The two phases—deception emergence and treatment—form an iterative cycle in which each phase updates the inputs of the next. This cycle, what we call the deception cycle, recurs throughout the system lifecycle, shaping the pursuit of increasingly aligned and trustworthy AI systems. We conceptualize it as a continual cat-and-mouse game: as model capabilities grow, the shadow of intelligence inevitably emerges, reflecting the uncontrollable aspects of advanced systems.
Treatment efforts aim to detect, evaluate, and resolve current deceptive behaviors to prevent further harm. Yet more capable models can develop novel forms of deception, including strategies to circumvent or exploit oversight, with treatment mechanisms themselves introducing new challenges. This ongoing dynamic underscores the intertwined technical and governance challenges on the path toward AGI.