Pad each batch, not the whole dataset

Open sshleifer opened this issue 6 years ago • 0 comments

Previously, each sequence was padded to the length of the longest sequence in the dataset. In this PR, each batch is padded to the length of the longest sequence in the batch. This results in a 30% speedup with negligible impact on metrics.

Code Changes

ChatDataset yields example dicts like {'input_ids': [[hist + cand1], ..[hist +cand_n]],} for the PADDED_INPUTS and mc_token_ids and mc_labels in the same format as previously.
ChatDataset().collate_fn(examples: list) turns a list of example dicts into the list of 5 tensors by batching them and padding them
As a result, get_dataloaders does much less
There is a data format change to the part of the process where we make lists of examples to facilitate this.
convai_evaluation.py still calls the old pad_dataset

1 Epoch Sanity Check

Before Change: 85 minutes Validation: {'accuracy': 0.7483655941545956, 'average_accuracy': 0.7483655941545956, 'average_nll': 2.6815188920676687, 'average_ppl': 14.607263311061963, 'nll': 2.6815188920676687}

After Change: 60 minutes Validation: {'accuracy': 0.7466991411357519, 'average_accuracy': 0.7466991411357519, 'average_nll': 2.6821035040007972, 'average_ppl': 14.615805388160778, 'nll': 2.6821035040007972}

Command:

python train.py --model_checkpoint openai-gpt --dataset_cache dataset_cache --fp16 O1 --n_epochs 1 --train_batch_size 4

Sep 23 '19 20:09 sshleifer