Mirage of Model Editing

This repository hosts the code and data for the paper: The Mirage of Model Editing: Revisiting Evaluation in the Wild

📢News

2025-06-02, we have released our QAEdit benchmark on Hugging Face.
2025-05-16, 🎉🎉 our paper "The Mirage of Model Editing: Revisiting Evaluation in the Wild" has been accepted to ACL 2025 Main Conference.
2025-03-04, our newly proposed WILD evaluation framework for model editing has been integrated into EasyEdit. You can also refer to it for a comprehensive evaluation of various editing techniques and datastes. Special thanks to the recognition and support of EasyEdit team.

Requirements:

Environment: requirements.txt (Please use Python 3.9+ for this repository)
```
pip install -r requirements.txt
```
Large Language Models to Edit:

You have three options to load LLMs for editing:
1. Download the LLMs you want to edit from Hugging Face and put them in ./hugging_cache/
2. Specify the path to your existing LLMs in the configuration files, e.g., ./hparams/FT/llama-7b.yaml:
```
model_name: "your/path/to/LLMs"
```
3. Provide the model name in the configuration files and the program will automatically employ from_pretrained to load the model:
```
model_name: "meta-llama/Llama-2-7b"
```
Datasets: The data of QAEdit, ZsRE, and COUNTERFACT are provided in ./data/
Stats for ROME and MEMIT:

You have three options to apply the ROME (R-ROME) and MEMIT editing algorithms:
1. Use Precomputed Stats Files (Recommended for Best Results)
  
  (a) download_stats.sh will download the required stats files from stats for llama2-7b (Provided by EasyEdit. And we will upload the stats files for llama3-8b and mistral-7b as soon as possible.) and put the wikipedia_stats directory into corresponding local directory ./data/stats/{model_name}/wikipedia_stats
```
sh download_stats.sh
```
  (b) Set mom2_adjustment to True in corresponding configuration file, e.g., ./hparams/ROME/llama-7b.yaml
```
mom2_adjustment: true
```
2. Calculate Stats Locally (Time-Consuming)
  
  If you do not provide the required stats files but set mom2_adjustment to True, the program will automatically calculate the stats required for them locally. However, the process is very time-consuming.
3. Quick Testing Without Stats Files (Approximate Results)
  
  If you want to quickly test the effects without using stats files, you can skip downloading or calculating them.
  
  Set mom2_adjustment to false in the corresponding configuration file, e.g., ./hparams/ROME/llama-7b.yaml (which is also the default setting).
```
mom2_adjustment: false
```
  This approach doesn't use stats files but can provide approximate editing effects.

Editing:

Training-required Method MEND:

For training-required method MEND, you need to run pretrain_mend.py to train a hypernetwork/editor before editing. This will obtain a trained hypernetwork and store it in ./results/models/MEND/

python pretrain_mend.py

Single & Sequential & Batch Editing

Single Editing:

python edit.py --editing_method FT --hparams_dir ./hparams/FT/llama-7b.yaml --data_path ./data/QAEdit.json --datatype qaedit --ds_size 100

Sequential Editing:

python edit.py --editing_method FT --hparams_dir ./hparams/FT/llama-7b.yaml --data_path ./data/QAEdit.json --datatype qaedit --ds_size 100 --sequential_edit True

(Note: the sequential editing refers to sample-wise sequential editing — editing one sample at a time continuously.)

Batch Editing:
```
python edit.py --editing_method FT --hparams_dir ./hparams/FT/llama-7b.yaml --data_path ./data/QAEdit.json --datatype qaedit --ds_size 100 --batch_edit True
```
(Note: the batch editing refers to mini-batch setting — continuously editing each batch of edits.)

You can adjust the batch_size in corresponding configuration file, e.g., ./hparams/FT/llama-7b.yaml
```
batch_size: 10
```

Evaluations:

We provide synthetic evaluation and WILD evaluation of our paper in this repository.

For aforementioned commands, the default configurations are

python edit.py ... # --evaluation_type WILD --context_type question-only --api_key None

You can specify evaluation_type, context_type, and api_key in the commands:

--evaluation_type: `WILD` or `synthetic`
--context_type: default configuration is `question-only`; `qa_inst` for QA task instruction; `chat_temp` for chat model
--api_key: `xxx` (Your api_key for LLM-as-a-Judge (GPT-4o-mini). If you cannot provide an api_key, we will default to provide exact match as an alternative.)

We will report the generated content for the corresponding fields of each sample. After editing is completed, you can extract these relevant fields and then perform LLM-as-a-Judge evaluations.

'post': {'rewrite_acc': 0.0, 'rewrite_gen_content': "Stone's Corner (now Unionville) 1 1 1831 1831 Stone's Corner (now Unionville) Original name of Forthton 204", 
'rephrase_acc': 0.0, 'rephrase_gen_content': "Stone's Corner Stone's Corner 1831 1831 12 10 100 "}

(Note: rewrite_acc and rephrase_acc denote the reliability and generalization metrics, and rewrite_gen_content and rephrase_gen_content denote corresponding generated content for metric calculation.)

Metrics

The program will automatically report the editing performance for each sample:

rewrite_acc: reliability in synthetic/WILD evaluation
rephrase_acc: generalization in synthetic/WILD evaluation
neighborhood_acc: locality in synthetic/WILD evaluation

Results

We present editing results under synthetic evaluation (syn.) and WILD evaluation (WILD) across various editing methods, LLMs, and datasets.

We will continue to update and share more evaluation results of additional LLMs and editing methods!

Citation

If you have any further questions, please feel free to contact us. And if you find our work helpful, please cite our paper~

@inproceedings{yang-etal-2025-mirage,
    title = "The Mirage of Model Editing: Revisiting Evaluation in the Wild",
    author = "Yang, Wanli  and
      Sun, Fei  and
      Tan, Jiajun  and
      Ma, Xinyu  and
      Cao, Qi  and
      Yin, Dawei  and
      Shen, Huawei  and
      Cheng, Xueqi",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.745/",
    doi = "10.18653/v1/2025.acl-long.745",
    pages = "15336--15354",
    ISBN = "979-8-89176-251-0"
}

Acknowledgment

Our code is based on EasyEdit and lm-evaluation-harness.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
easyeditor		easyeditor
figs		figs
hparams		hparams
LICENSE		LICENSE
README.md		README.md
download_stats.sh		download_stats.sh
edit.py		edit.py
pretrain_mend.py		pretrain_mend.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mirage of Model Editing

Table of Contents

📢News

Requirements:

Editing:

Training-required Method MEND:

Single & Sequential & Batch Editing

Evaluations:

Metrics

Results

Citation

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

WanliYoung/Revisit-Editing-Evaluation

Folders and files

Latest commit

History

Repository files navigation

Mirage of Model Editing

Table of Contents

📢News

Requirements:

Editing:

Training-required Method MEND:

Single & Sequential & Batch Editing

Evaluations:

Metrics

Results

Citation

Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages