Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

📚 Overview

📢 News
📖 Introduction
✨ Getting Started
🔧 Usage
🙏 Citation
🌻 Acknowledgement

📢News

[2025/10/06] Our paper is available on Huggingface. If you enjoy our work, we warmly invite you to upvote it on Huggingface!
[2025/09/30] Our paper is available on arXiv.

📖Introduction

Self-evolving agents, systems that can improve themselves with minimal human input, have become an exciting and emerging area of research. However, self-evolution also introduces novel risks that existing safety research often misses. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: the agent's model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, even for agents built on top models like Gemini-2.5-Pro. Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents.

Misevolution can happen in various scenarios

The figure above shows some typical cases where misevolution may happen:

(a) Biased memory evolution leads to over-refunding. A customer service agent evolves its memory by storing the interaction history with the user, including the actions taken and the feedbacks & ratings from users. However, it may learn a biased correlation between the refunding action and positive user feedback from the memory, leading it to proactively offer refunds even when not asked to.

(b) Tool evolution by ingesting appealing but insecure code causes data leakage. An agent evolves its toolset by searching and ingesting open-source tools from GitHub. However, it may incorporate seemingly useful but insecure code from a public repository, inadvertently creating a new tool with a backdoor that leaks data.

(c) Inappropriate cross-domain tool reuse leads to privacy issues. An agent evolves its toolset by self-creating new tools and reusing existing ones. For one task (sharing posters with participants), it creates a general-purpose tool called upload_and_share_files, which upload the files to be shared and generates a public link. Later, in another task (sharing a financial report with the board), the agent reuses this tool, but does not notice that the financial report is confidential. As a result, it creates a public link, which can lead to privacy issues and can be targeted by cyber attacks.

✨ Getting Started

Coming soon.

🔧 Usage

Model Misevolution

Self-generated Data

In the self-generated data paradigm, we mainly tested Absolute-Zero and AgentGen on a series of established safety benchmarks, including HarmBench, SALAD-Bench, HEx-PHI, and Agent-SafetyBench.

To reproduce our results on HarmBench and HEx-PHI, run the following commands:

cd ./model_misevolution/harmbench
# To run tests on HarmBench
bash ./run_harmbench_pipeline.sh
# To run tests on HEx-PHI
bash ./run_hex-phi_pipeline.sh

In each bash script above, you can choose the models you would like to test. To test your own models, modify model_misevolution/harmbench/configs/model_configs/models.yaml to add model names, paths, etc. HEx-PHI requires an LLM judge to evaluate the results. Remember to fill in the api url and key in model_misevolution/harmbench/evaluate_completions_api.py.

To reproduce our results on Salad-Bench, run the following commands:

cd ./model_misevolution/SaladBench
bash eval_saladbench.sh

To reproduce our results on Agent-SafetyBench, run the following commands:

## 1. generate
cd ./model_misevolution/Agent-SafetyBench/evaluation
bash eval.sh
## 2. evaluate
cd ./model_misevolution/Agent-SafetyBench/score
bash eval_with_shield.sh

In each bash script above, you can choose the models you would like to test.

Self-generated Curriculum

In the self-generated curriculum paradigm, we tested UI-TARS-7B-DPO (initial model, before evolution) and SEAgent (after evolution) on the RiOSWorld benchmark.

For detailed instructions on installation and testing, we kindly refer readers to the RiOSWorld project.

Memory Misevolution

Deployment-time Reward Hacking

To reproduce the deployment-time reward hacking results, you may first set your base_url and api_key in memory_misevolution/reward_hacking_test.py.

Then, you can run the following command:

cd ./memory_misevolution
python reward_hacking_test.py --model gemini-2.5-pro --scenario finance

Tool Misevolution

Insecure Tool Creation and Reuse

To reproduce the insecure tool creation and reuse experiment, you may first set your base_url, api_key and model to be evaluated in the config.py file.

Then, you can run the following command:

cd ./tool_misevolution
bash insecure_tool_evaluation.sh

🙏 Citation

If you find this work useful, please consider citing:

@article{shao2025misevolution,
    title={Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents}, 
    author={Shuai Shao and Qihan Ren and Chen Qian and Boyi Wei and Dadi Guo and Jingyi Yang and Xinhao Song and Linfeng Zhang and Weinan Zhang and Dongrui Liu and Jing Shao},
    journal={arXiv preprint arXiv:2509.26354},
    year={2025}
}

🌻 Acknowledgements

This work is partially inspired by this survey on self-evolving agents. Part of our evaluation code is based from Harmbench, SALAD-Bench, LLMs-Finetuning-Safety, Agent-SafetyBench, RiOSWorld, and RedCode. Thanks to these wonderful works!

We also sincerely appreciate the following works for making their open-weight models available, which greatly facilitated our testing: Absolute-Zero, AgentGen, SEAgent.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
figures		figures
memory_misevolution		memory_misevolution
model_misevolution		model_misevolution
tool_misevolution		tool_misevolution
workflow_misevolution		workflow_misevolution
.gitignore		.gitignore
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

📚 Overview

📢News

📖Introduction

Misevolution can happen in various scenarios

✨ Getting Started

🔧 Usage

Model Misevolution

Self-generated Data

Self-generated Curriculum

Memory Misevolution

Deployment-time Reward Hacking

Tool Misevolution

Insecure Tool Creation and Reuse

🙏 Citation

🌻 Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

ShaoShuai0605/Misevolution

Folders and files

Latest commit

History

Repository files navigation

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

📚 Overview

📢News

📖Introduction

Misevolution can happen in various scenarios

✨ Getting Started

🔧 Usage

Model Misevolution

Self-generated Data

Self-generated Curriculum

Memory Misevolution

Deployment-time Reward Hacking

Tool Misevolution

Insecure Tool Creation and Reuse

🙏 Citation

🌻 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages