This repository contains the official implementation of our paper:
Information Theoretic Discrete Diffusion.
We provide code for four experiments introduced in the paper.
All experiments can be run on a single NVIDIA L40S GPU (48GB VRAM).
Create and activate the Conda environment:
conda env create -f environment.yaml
conda activate infodisRefer to Section 4.2 (Detecting Out-of-Distribution Inputs) and Figure 3 in the paper.
- The model (RADD) is trained on
text8. - Conditional NLL is evaluated on both
text8and GPT-generated text.
CUDA_VISIBLE_DEVICES=0 python pretrain_text8.pyThis will create checkpoints like:
checkpoints/text8/checkpoint_7501.pth
which indicates the model was trained for 7500 steps.
Edit the ckpt_dir in eval_text8.py to point to the desired checkpoint from "None", for instance:
ckpt_dir = "checkpoints/text8/checkpoint_7501.pth"Then run:
CUDA_VISIBLE_DEVICES=0 python eval_text8.pyThis will automatically produce figures in the ./figures directory.
The following experiments follow a similar procedure.
Refer to Section 4.1 (Evaluating reliability of I-MDCE on toy dataset) and Figures 1 and 2.
These experiments evaluate NLL on synthetic DNA sequence datasets.
- 128-sequence experiment:
CUDA_VISIBLE_DEVICES=0 python pretrain_sequence.py
CUDA_VISIBLE_DEVICES=0 python eval_sequence.py- 4th-order Markov experiment:
CUDA_VISIBLE_DEVICES=0 python pretrain_markov.py
CUDA_VISIBLE_DEVICES=0 python eval_markov.pyNote. Manually update ckpt_dir in the evaluation scripts before running.
Refer to Section 4.2 (Application to a Large-Scale Open-Source Model) and Figures 4 and 8.
This experiment evaluates NLL using LLaDA, a recent open-source model, on the following datasets:
wikitextpretrain_zh- LLaMA-generated responses
CUDA_VISIBLE_DEVICES=0 python eval_llada.py --dataset wikitextyou can reduce the dataset size manually as our default evaluation file contains lots of data. However, if you reduce the number of Monte Carlo estimation dramatically, the result won't be accurate.