- 📢 News
- 📖 Introduction
- ✨ Getting Started
- 🔧 Usage
- 🙏 Citation
- 🌻 Acknowledgement
- [2025/10/06] Our paper is available on Huggingface. If you enjoy our work, we warmly invite you to upvote it on Huggingface!
- [2025/09/30] Our paper is available on arXiv.
Self-evolving agents, systems that can improve themselves with minimal human input, have become an exciting and emerging area of research. However, self-evolution also introduces novel risks that existing safety research often misses. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: the agent's model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, even for agents built on top models like Gemini-2.5-Pro. Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents.
The figure above shows some typical cases where misevolution may happen:
(a) Biased memory evolution leads to over-refunding. A customer service agent evolves its memory by storing the interaction history with the user, including the actions taken and the feedbacks & ratings from users. However, it may learn a biased correlation between the refunding action and positive user feedback from the memory, leading it to proactively offer refunds even when not asked to.
(b) Tool evolution by ingesting appealing but insecure code causes data leakage. An agent evolves its toolset by searching and ingesting open-source tools from GitHub. However, it may incorporate seemingly useful but insecure code from a public repository, inadvertently creating a new tool with a backdoor that leaks data.
(c) Inappropriate cross-domain tool reuse leads to privacy issues. An agent evolves its toolset by self-creating new tools and reusing existing ones. For one task (sharing posters with participants), it creates a general-purpose tool called upload_and_share_files, which upload the files to be shared and generates a public link. Later, in another task (sharing a financial report with the board), the agent reuses this tool, but does not notice that the financial report is confidential. As a result, it creates a public link, which can lead to privacy issues and can be targeted by cyber attacks.
Coming soon.
In the self-generated data paradigm, we mainly tested Absolute-Zero and AgentGen on a series of established safety benchmarks, including HarmBench, SALAD-Bench, HEx-PHI, and Agent-SafetyBench.
To reproduce our results on HarmBench and HEx-PHI, run the following commands:
cd ./model_misevolution/harmbench
# To run tests on HarmBench
bash ./run_harmbench_pipeline.sh
# To run tests on HEx-PHI
bash ./run_hex-phi_pipeline.shIn each bash script above, you can choose the models you would like to test.
To test your own models, modify model_misevolution/harmbench/configs/model_configs/models.yaml to add model names, paths, etc.
HEx-PHI requires an LLM judge to evaluate the results. Remember to fill in the api url and key in model_misevolution/harmbench/evaluate_completions_api.py.
To reproduce our results on Salad-Bench, run the following commands:
cd ./model_misevolution/SaladBench
bash eval_saladbench.shTo reproduce our results on Agent-SafetyBench, run the following commands:
## 1. generate
cd ./model_misevolution/Agent-SafetyBench/evaluation
bash eval.sh
## 2. evaluate
cd ./model_misevolution/Agent-SafetyBench/score
bash eval_with_shield.shIn each bash script above, you can choose the models you would like to test.
In the self-generated curriculum paradigm, we tested UI-TARS-7B-DPO (initial model, before evolution) and SEAgent (after evolution) on the RiOSWorld benchmark.
For detailed instructions on installation and testing, we kindly refer readers to the RiOSWorld project.
To reproduce the deployment-time reward hacking results, you may first set your base_url and api_key in memory_misevolution/reward_hacking_test.py.
Then, you can run the following command:
cd ./memory_misevolution
python reward_hacking_test.py --model gemini-2.5-pro --scenario financeTo reproduce the insecure tool creation and reuse experiment, you may first set your base_url, api_key and model to be evaluated in the config.py file.
Then, you can run the following command:
cd ./tool_misevolution
bash insecure_tool_evaluation.shIf you find this work useful, please consider citing:
@article{shao2025misevolution,
title={Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents},
author={Shuai Shao and Qihan Ren and Chen Qian and Boyi Wei and Dadi Guo and Jingyi Yang and Xinhao Song and Linfeng Zhang and Weinan Zhang and Dongrui Liu and Jing Shao},
journal={arXiv preprint arXiv:2509.26354},
year={2025}
}This work is partially inspired by this survey on self-evolving agents. Part of our evaluation code is based from Harmbench, SALAD-Bench, LLMs-Finetuning-Safety, Agent-SafetyBench, RiOSWorld, and RedCode. Thanks to these wonderful works!
We also sincerely appreciate the following works for making their open-weight models available, which greatly facilitated our testing: Absolute-Zero, AgentGen, SEAgent.

