This repository hosts the code and data for the paper: The Mirage of Model Editing: Revisiting Evaluation in the Wild
- 2025-06-02, we have released our QAEdit benchmark on Hugging Face.
- 2025-05-16, 🎉🎉 our paper "The Mirage of Model Editing: Revisiting Evaluation in the Wild" has been accepted to ACL 2025 Main Conference.
- 2025-03-04, our newly proposed WILD evaluation framework for model editing has been integrated into EasyEdit. You can also refer to it for a comprehensive evaluation of various editing techniques and datastes. Special thanks to the recognition and support of EasyEdit team.
-
Environment:
requirements.txt(Please use Python 3.9+ for this repository)pip install -r requirements.txt
-
Large Language Models to Edit:
You have three options to load LLMs for editing:
-
Download the LLMs you want to edit from Hugging Face and put them in
./hugging_cache/ -
Specify the path to your existing LLMs in the configuration files, e.g.,
./hparams/FT/llama-7b.yaml:model_name: "your/path/to/LLMs"
-
Provide the model name in the configuration files and the program will automatically employ
from_pretrainedto load the model:model_name: "meta-llama/Llama-2-7b"
-
-
Datasets: The data of QAEdit, ZsRE, and COUNTERFACT are provided in
./data/ -
Stats for ROME and MEMIT:
You have three options to apply the ROME (R-ROME) and MEMIT editing algorithms:
-
Use Precomputed Stats Files (Recommended for Best Results)
(a)
download_stats.shwill download the required stats files from stats for llama2-7b (Provided by EasyEdit. And we will upload the stats files for llama3-8b and mistral-7b as soon as possible.) and put thewikipedia_statsdirectory into corresponding local directory./data/stats/{model_name}/wikipedia_statssh download_stats.sh
(b) Set
mom2_adjustmenttoTruein corresponding configuration file, e.g.,./hparams/ROME/llama-7b.yamlmom2_adjustment: true
-
Calculate Stats Locally (Time-Consuming)
If you do not provide the required stats files but set
mom2_adjustmenttoTrue, the program will automatically calculate the stats required for them locally. However, the process is very time-consuming. -
Quick Testing Without Stats Files (Approximate Results)
If you want to quickly test the effects without using stats files, you can skip downloading or calculating them.
Set
mom2_adjustmenttofalsein the corresponding configuration file, e.g.,./hparams/ROME/llama-7b.yaml(which is also the default setting).mom2_adjustment: false
This approach doesn't use stats files but can provide approximate editing effects.
-
For training-required method MEND, you need to run pretrain_mend.py to train a hypernetwork/editor before editing. This will obtain a trained hypernetwork and store it in ./results/models/MEND/
python pretrain_mend.py-
Single Editing:
python edit.py --editing_method FT --hparams_dir ./hparams/FT/llama-7b.yaml --data_path ./data/QAEdit.json --datatype qaedit --ds_size 100
-
Sequential Editing:
python edit.py --editing_method FT --hparams_dir ./hparams/FT/llama-7b.yaml --data_path ./data/QAEdit.json --datatype qaedit --ds_size 100 --sequential_edit True
(Note: the sequential editing refers to sample-wise sequential editing — editing one sample at a time continuously.)
-
Batch Editing:
python edit.py --editing_method FT --hparams_dir ./hparams/FT/llama-7b.yaml --data_path ./data/QAEdit.json --datatype qaedit --ds_size 100 --batch_edit True
(Note: the batch editing refers to mini-batch setting — continuously editing each batch of edits.)
You can adjust the
batch_sizein corresponding configuration file, e.g.,./hparams/FT/llama-7b.yamlbatch_size: 10
We provide synthetic evaluation and WILD evaluation of our paper in this repository.
For aforementioned commands, the default configurations are
python edit.py ... # --evaluation_type WILD --context_type question-only --api_key NoneYou can specify evaluation_type, context_type, and api_key in the commands:
--evaluation_type: `WILD` or `synthetic`
--context_type: default configuration is `question-only`; `qa_inst` for QA task instruction; `chat_temp` for chat model
--api_key: `xxx` (Your api_key for LLM-as-a-Judge (GPT-4o-mini). If you cannot provide an api_key, we will default to provide exact match as an alternative.)We will report the generated content for the corresponding fields of each sample. After editing is completed, you can extract these relevant fields and then perform LLM-as-a-Judge evaluations.
'post': {'rewrite_acc': 0.0, 'rewrite_gen_content': "Stone's Corner (now Unionville) 1 1 1831 1831 Stone's Corner (now Unionville) Original name of Forthton 204",
'rephrase_acc': 0.0, 'rephrase_gen_content': "Stone's Corner Stone's Corner 1831 1831 12 10 100 "}(Note: rewrite_acc and rephrase_acc denote the reliability and generalization metrics, and rewrite_gen_content and rephrase_gen_content denote corresponding generated content for metric calculation.)
The program will automatically report the editing performance for each sample:
rewrite_acc: reliability in synthetic/WILD evaluationrephrase_acc: generalization in synthetic/WILD evaluationneighborhood_acc: locality in synthetic/WILD evaluation
We present editing results under synthetic evaluation (syn.) and WILD evaluation (WILD) across various editing methods, LLMs, and datasets.
We will continue to update and share more evaluation results of additional LLMs and editing methods!
If you have any further questions, please feel free to contact us. And if you find our work helpful, please cite our paper~
@inproceedings{yang-etal-2025-mirage,
title = "The Mirage of Model Editing: Revisiting Evaluation in the Wild",
author = "Yang, Wanli and
Sun, Fei and
Tan, Jiajun and
Ma, Xinyu and
Cao, Qi and
Yin, Dawei and
Shen, Huawei and
Cheng, Xueqi",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.745/",
doi = "10.18653/v1/2025.acl-long.745",
pages = "15336--15354",
ISBN = "979-8-89176-251-0"
}
Our code is based on EasyEdit and lm-evaluation-harness.
