Information Theoretic Discrete Diffusion

This repository contains the official implementation of our paper:
Information Theoretic Discrete Diffusion.

We provide code for four experiments introduced in the paper.
All experiments can be run on a single NVIDIA L40S GPU (48GB VRAM).

1. Environment Setup

Create and activate the Conda environment:

conda env create -f environment.yaml
conda activate infodis

2. Running Experiments

[1] Detecting Out-of-Distribution Inputs

Refer to Section 4.2 (Detecting Out-of-Distribution Inputs) and Figure 3 in the paper.

The model (RADD) is trained on text8.
Conditional NLL is evaluated on both text8 and GPT-generated text.

Pre-training

CUDA_VISIBLE_DEVICES=0 python pretrain_text8.py

This will create checkpoints like:

checkpoints/text8/checkpoint_7501.pth

which indicates the model was trained for 7500 steps.

Evaluation

Edit the ckpt_dir in eval_text8.py to point to the desired checkpoint from "None", for instance:

ckpt_dir = "checkpoints/text8/checkpoint_7501.pth"

Then run:

CUDA_VISIBLE_DEVICES=0 python eval_text8.py

This will automatically produce figures in the ./figures directory.

The following experiments follow a similar procedure.

[2] Toy Experiments

Refer to Section 4.1 (Evaluating reliability of I-MDCE on toy dataset) and Figures 1 and 2.

These experiments evaluate NLL on synthetic DNA sequence datasets.

Pre-training and Evaluation

128-sequence experiment:

CUDA_VISIBLE_DEVICES=0 python pretrain_sequence.py
CUDA_VISIBLE_DEVICES=0 python eval_sequence.py

4th-order Markov experiment:

CUDA_VISIBLE_DEVICES=0 python pretrain_markov.py
CUDA_VISIBLE_DEVICES=0 python eval_markov.py

Note. Manually update ckpt_dir in the evaluation scripts before running.

[3] LLaDA Experiments

Refer to Section 4.2 (Application to a Large-Scale Open-Source Model) and Figures 4 and 8.

This experiment evaluates NLL using LLaDA, a recent open-source model, on the following datasets:

wikitext
pretrain_zh
LLaMA-generated responses

Evaluation Example

CUDA_VISIBLE_DEVICES=0 python eval_llada.py --dataset wikitext

you can reduce the dataset size manually as our default evaluation file contains lots of data. However, if you reduce the number of Monte Carlo estimation dramatically, the result won't be accurate.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
data		data
model		model
.gitignore		.gitignore
README.md		README.md
catsample.py		catsample.py
data.py		data.py
environment.yaml		environment.yaml
eval_llada.py		eval_llada.py
eval_markov.py		eval_markov.py
eval_sequence.py		eval_sequence.py
eval_text8.py		eval_text8.py
losses.py		losses.py
noise_lib.py		noise_lib.py
pretrain_markov.py		pretrain_markov.py
pretrain_sequence.py		pretrain_sequence.py
pretrain_text8.py		pretrain_text8.py
sampling.py		sampling.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Theoretic Discrete Diffusion

1. Environment Setup

2. Running Experiments

[1] Detecting Out-of-Distribution Inputs

Pre-training

Evaluation

[2] Toy Experiments

Pre-training and Evaluation

[3] LLaDA Experiments

Evaluation Example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Information Theoretic Discrete Diffusion

1. Environment Setup

2. Running Experiments

[1] Detecting Out-of-Distribution Inputs

Pre-training

Evaluation

[2] Toy Experiments

Pre-training and Evaluation

[3] LLaDA Experiments

Evaluation Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages