Official Codebase for Enable Fast Sampling for Seq2Seq Text Diffusion.
Performance: Picture on the left shows BLEU Scores of different models for the paraphrase task on the QQP dataset. Our FMSeq beats all the models when using a single sampling step and achieves comparable performance to DiffuSeq (2000 steps) with only 10 steps. Workflow: Picture on the right shows workflow of FMSeq. We utilize embedding to map the discrete token space into a continuous space. The forward process diffuses the target embedding along a linear path, and the model fits the velocity of the target part conditioned on clean source embedding and noisy target embedding.
Prepare datasets and put them under the datasets folder. Take datasets/CommonsenseConversation/train.jsonl as an example. We use four datasets in our paper.
| Task | Datasets | Source |
|---|---|---|
| Open-domain Dialogue | CommonsenseConversation | download |
| Question Generation | Quasar-T | download |
| Text Simplification | Wiki-alignment | download |
| Paraphrase | QQP-Official | download |
| Machine Translation | iwslt14-de-en | download |
For Non-MT (Machine Translation) tasks, run:
cd scripts
# qqp:
bash train_qqp.sh
# others: modify learning_steps, dataset, data_dir, notesFor MT tasks, run:
cd scripts
bash train_de2en.shThe trained checkpoints are provided here: link of ckpt
cd scripts
bash run_decode.sh
# core parameters: step and tdcd scripts
bash eval.sh
# you can eval single file or multiple file which are in the same folder (mbr in default)Please add the citation if our paper or code helps you.
@inproceedings{liu2024enable,
title={Enable Fast Sampling for Seq2Seq Text Diffusion},
author={Liu, Pan and Tian, Xiaohua and Lin, Zhouhan},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},
pages={8495--8505},
year={2024}
}This implementation is based on DiffuSeq

