pytorch

Introduction

This directory contains our pytorch implementation of Transformer-XL. Note that our state-of-the-art results reported in the paper were obtained by training the model on a large-scale TPU cluster, and our pytorch codebase currently does not support distributed training. Here we provide two sets of hyperparameters and scripts:

*large.sh are for the SoTA setting with large models which might not be directly runnable on a local GPU machine.
*base.sh are for the base models which can be run on a few GPUs.

The pytorch implementation produces similar results to the TF codebase under the same settings in our preliminary experiments.

Prerequisite

Pytorch 0.4: conda install pytorch torchvision -c pytorch

Data Prepration

bash getdata.sh

Training and Evaluation

Replicate the "bpc = 1.06" result on `enwik8` with a 12-layer Transformer-XL

Make sure the machine have 4 GPUs, each with at least 11G memory
Training

bash run_enwik8_base.sh train --work_dir PATH_TO_WORK_DIR
Evaluation

bash run_enwik8_base.sh eval --work_dir PATH_TO_WORK_DIR

Replicate the "PPL = 24.03" result on `wikitext-103` with Transformer-XL

Make sure the machine have 4 GPUs, each with at least 11G memory
Training

bash run_wt103_base.sh train --work_dir PATH_TO_WORK_DIR
Evaluation

bash run_wt103_base.sh eval --work_dir PATH_TO_WORK_DIR

Other options:

--batch_chunk: this option allows one to trade speed for memory. For batch_chunk > 1, the program will split each training batch into batch_chunk sub-batches and perform forward and backward on each sub-batch sequentially, with the gradient accumulated and divided by batch_chunk. Hence, the memory usage will propertionally lower while the computation time will inversely higher.
--div_val: when using adaptive softmax and embedding, the embedding dimension is divided by div_val from bin $i$ to bin $i+1$. This saves both GPU memory and the parameter budget.
--fp16 and --dynamic-loss-scale: Run in pseudo-fp16 mode (fp16 storage fp32 math) with dynamic loss scaling.
- Note: to explore the --fp16 option, please make sure the apex package is installed (https://github.com/NVIDIA/apex/).
To see performance without the recurrence mechanism, simply use mem_len=0 in all your scripts.
To see performance of a standard Transformer without relative positional encodings or recurrence mechanisms, use attn_type=2 and mem_len=0.

Other datasets:

Text8 character-level language modeling: check out run_text8_base.sh
lm1b word-level language modeling: check out run_lm1b_base.sh

Name		Name	Last commit message	Last commit date
parent directory ..
utils		utils
.DS_Store		.DS_Store
README.md		README.md
data_utils.py		data_utils.py
eval.py		eval.py
mem_transformer.py		mem_transformer.py
run_enwik8_base.sh		run_enwik8_base.sh
run_enwik8_large.sh		run_enwik8_large.sh
run_lm1b_base.sh		run_lm1b_base.sh
run_lm1b_large.sh		run_lm1b_large.sh
run_text8_base.sh		run_text8_base.sh
run_text8_large.sh		run_text8_large.sh
run_wt103_base.sh		run_wt103_base.sh
run_wt103_large.sh		run_wt103_large.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Introduction

Prerequisite

Data Prepration

Training and Evaluation

Replicate the "bpc = 1.06" result on `enwik8` with a 12-layer Transformer-XL

Replicate the "PPL = 24.03" result on `wikitext-103` with Transformer-XL

Other options:

Other datasets:

FilesExpand file tree

pytorch

Directory actions

More options

Directory actions

More options

Latest commit

History

pytorch

Folders and files

parent directory

README.md

Introduction

Prerequisite

Data Prepration

Training and Evaluation

Replicate the "bpc = 1.06" result on enwik8 with a 12-layer Transformer-XL

Replicate the "PPL = 24.03" result on wikitext-103 with Transformer-XL

Other options:

Other datasets:

Replicate the "bpc = 1.06" result on `enwik8` with a 12-layer Transformer-XL

Replicate the "PPL = 24.03" result on `wikitext-103` with Transformer-XL