This repository contains the source code to replicate the experimental results in our paper.
We use Anaconda 24.3.0 to set up our virtual environment in Python.
conda create -n private-synthetic-text-generation python=3.8
conda activate private-synthetic-text-generationWe install the remaining requirements with pip.
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txtPlease download the respective datasets and put the csv files in the destination folders (SWMH access needs to be granted by its creators).
| Dataset | Source | Manually move to |
|---|---|---|
| Drugs.com | Already in Repository | not needed |
| SPAM | 🔗 | data/spam/ 📂 |
| SWMH | 🔗 | data/swmh/ 📂 |
| Thumbs-Up | Already available on huggingface datasets | not needed |
| WebMD | 🔗 | data/webmd/ 📂 |
Then you can run the three preprocessing script:
python preprocessing.py
python create_samples.py
python create_val_sets.pyOur code relies on some publicly available text diffusion model checkpoints, which you can download here:
| Model | Source | Manually move to |
|---|---|---|
| GENIE | 🔗 | GENIE/ 📂 |
| DiffuSeq | 🔗 | DiffuSeq/ 📂 |
| SeqDiffuSeq | t.b.d. | SeqDiffuSeq/ 📂 |
Please use the following citation:
@misc{ochs2024privatesynthetictextgeneration,
title={Private Synthetic Text Generation with Diffusion Models},
author={Sebastian Ochs and Ivan Habernal},
year={2024},
eprint={2410.22971},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.22971},
}
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.