Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

This repository contains the official implementation of Adversarial Paraphrasing, introduced in the paper:
📄 Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

Adversarial Paraphrasing is a universal, training-free attack that humanizes AI-generated text to evade a wide variety of AI text detectors.

🧠 Overview

Adversarial paraphrasing is a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection.

Key Insight: Most, if not all, high-performing detectors tend to converge toward a common distribution that characterizes human-authored text. Consequently, if a paraphraser is guided to evade detection by a well-trained detector, its outputs will naturally align more closely with this shared, human-like distribution. As a result, the generated text becomes more difficult to detect—not only by the detector used for guidance, but also by others, since all are ideally calibrated to the same underlying distribution of human-written text.

Ours experiments show that our attack is both broadly effective and highly transferable across several detection systems.

For questions or feedback, please email Yize Cheng and Vinu Sankar Sadasivan.

📁 Repository Structure

utils.py: Core implementation of the Adversarial Paraphrasing algorithm.
quality_judge_utils.py: Prompt templates for GPT-4o quality evaluation.
outputs/: Saved outputs of simple and adversarially paraphrased texts, as well as detection scores on human texts that were used for plotting the ROC curve.
scripts/: Scripts for running paraphrasing, detection, and dataset creation.
kgw_wm/wm_mage, uni_wm/wm_mage: Watermarked datasets (KGW and Unigram).
Implementation of detectors used in the paper.
parseNsave_paraphrased_output.py: Parses and saves paraphrased outputs into HuggingFace dataset format.

⚙️ Installation

We recommend creating a separate virtual or conda environment with python>=3.10, and then run:

pip install -r requirements.txt

📦 Watermarked Datasets

We covered two types of watermarks in our experiments:

KGW
Unigram

The watermarked datasets are available under kgw_wm/wm_mage/ and uni_wm/wm_mage/ respectively (in HuggingFace format).

To recreate them:

# SLURM
sbatch scripts/create_wm_mage.sbatch

# or local CUDA
bash scripts/create_wm_mage.sbatch

Customize the wm_name variable in the script to either kgw_wm or uni_wm.

🚀 Running Adversarial Paraphrasing

To run adversarial paraphrasing on original AI texts:

# SLURM
sbatch scripts/transfer_test.sbatch

# or local CUDA
bash scripts/transfer_test.sbatch > path_to_log_file.log

Parameters to customize:

dataset: Input dataset (must match detector type, e.g. watermarked).
guidance_classifier: Detector used to guide paraphrasing.
deploy_classifier: Detector used to evaluate paraphrased outputs.
adversarial=1: Enables adversarial paraphrasing.
Set to 0 for regular paraphrasing (guidance classifier is ignored in this case).

When run with SLURM, outputs are saved in SLURM logs. In local environments, redirect output to a file to capture results.

The results include both the input and output texts, along with their detection scores. Lower scores indicate more 'human-like' content, while higher scores suggest more 'AI-generated' content.

🔁 Testing Other Detectors on Existing Outputs

Once you've generated adversarially paraphrased outputs, you can evaluate them using different detectors without re-paraphrasing:

Extract the paraphrased outputs:

Edit and run:
```
python parseNsave_paraphrased_output.py
```

Run additional detectors on these outputs:

# SLURM
sbatch scripts/detect_existing_paraphrased_text.sbatch

# or local CUDA
bash scripts/detect_existing_paraphrased_text.sbatch

🌟 Text Quality Evaluation (with GPT-4o)

We provide utilities to evaluate the quality of the paraphrased texts using GPT-4o.

Set your OPENAI_API_KEY as an environment variable.
Load saved texts in any way you prefer and pass them to the functions in quality_judge_utils.py.
The included prompt templates follow our experimental design for consistent quality evaluation.

citation

If you find our work useful, please consider citing us via:

@misc{cheng2025adversarialparaphrasinguniversalattack,
      title={Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text}, 
      author={Yize Cheng and Vinu Sankar Sadasivan and Mehrdad Saberi and Shoumik Saha and Soheil Feizi},
      year={2025},
      eprint={2506.07001},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.07001}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
MAGE/deployment		MAGE/deployment
kgw_wm		kgw_wm
outputs		outputs
readme_imgs		readme_imgs
scripts		scripts
uni_wm		uni_wm
zs_detectors		zs_detectors
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
detect_existing_paraphrased_text.py		detect_existing_paraphrased_text.py
log_human_text_scores.py		log_human_text_scores.py
paraphrase_and_detect.py		paraphrase_and_detect.py
parseNsave_paraphrased_output.py		parseNsave_paraphrased_output.py
quality_judge_utils.py		quality_judge_utils.py
requirements.txt		requirements.txt
text_loader.py		text_loader.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

🧠 Overview

📁 Repository Structure

⚙️ Installation

📦 Watermarked Datasets

🚀 Running Adversarial Paraphrasing

🔁 Testing Other Detectors on Existing Outputs

🌟 Text Quality Evaluation (with GPT-4o)

citation

About

Uh oh!

Releases

Packages

Languages

License

chengez/Adversarial-Paraphrasing

Folders and files

Latest commit

History

Repository files navigation

Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

🧠 Overview

📁 Repository Structure

⚙️ Installation

📦 Watermarked Datasets

🚀 Running Adversarial Paraphrasing

🔁 Testing Other Detectors on Existing Outputs

🌟 Text Quality Evaluation (with GPT-4o)

citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages