transfer-learning-conv-ai
transfer-learning-conv-ai copied to clipboard
Pad each batch, not the whole dataset
Previously, each sequence was padded to the length of the longest sequence in the dataset. In this PR, each batch is padded to the length of the longest sequence in the batch. This results in a 30% speedup with negligible impact on metrics.
Code Changes
-
ChatDatasetyields example dicts like{'input_ids': [[hist + cand1], ..[hist +cand_n]],}for thePADDED_INPUTSandmc_token_idsandmc_labelsin the same format as previously. -
ChatDataset().collate_fn(examples: list)turns a list of example dicts into the list of 5 tensors by batching them and padding them - As a result,
get_dataloadersdoes much less - There is a data format change to the part of the process where we make lists of examples to facilitate this.
-
convai_evaluation.pystill calls the oldpad_dataset
1 Epoch Sanity Check
Before Change: 85 minutes Validation: {'accuracy': 0.7483655941545956, 'average_accuracy': 0.7483655941545956, 'average_nll': 2.6815188920676687, 'average_ppl': 14.607263311061963, 'nll': 2.6815188920676687}
After Change: 60 minutes Validation: {'accuracy': 0.7466991411357519, 'average_accuracy': 0.7466991411357519, 'average_nll': 2.6821035040007972, 'average_ppl': 14.615805388160778, 'nll': 2.6821035040007972}
Command:
python train.py --model_checkpoint openai-gpt --dataset_cache dataset_cache --fp16 O1 --n_epochs 1 --train_batch_size 4