Fix transformer.cu interleaved matmul for cuda arch < 5 #17596

leezu · 2020-02-14T20:30:19Z

Description

cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0.

Fixes a bug in #16408

Changes

Fix transformer.cu interleaved matmul for cuda arch < 5

Comments

CC @Caenorst

access2rohit

LGTM!

leezu · 2020-02-15T01:29:13Z

Verified this patch by finetuning Bert on P2 instance.

Verification was initially blocked / delayed by #17576 ...

% python finetune_classifier.py --task_name RTE --batch_size 32 --epochs 3 --gpu 0 --lr 2e-5
INFO:root:01:21:10 Namespace(accumulate=None, batch_size=32, bert_dataset='book_corpus_wiki_en_uncased', bert_model='bert_12_768_12', calib_mode='customize', deploy=False, dev_batch_size=8, dtype='float32', early_stop=None, epochs=3, epsilon=1e-06, gpu=0, log_interval=10, lr=2e-05, max_len=128, model_parameters=None, model_prefix=None, num_calib_batches=5, only_calibration=False, only_inference=False, optimizer='bertadam', output_dir='./output_dir', pretrained_bert_parameters=None, quantized_dtype='auto', round_to=None, seed=2, task_name='RTE', training_steps=None, warmup_ratio=0.1)
[01:21:12] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7501, which is older than the oldest version tested by CI (7600).  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
INFO:root:01:21:26 processing dataset...
INFO:root:01:21:35 Now we are doing BERT classification training on gpu(0)!
INFO:root:01:21:35 training steps=233
INFO:root:01:21:45 [Epoch 1 Batch 10/82] loss=0.7479, lr=0.0000078, metrics:accuracy:0.5507
INFO:root:01:21:54 [Epoch 1 Batch 20/82] loss=0.7263, lr=0.0000165, metrics:accuracy:0.5235
INFO:root:01:22:02 [Epoch 1 Batch 30/82] loss=0.6821, lr=0.0000194, metrics:accuracy:0.5306
INFO:root:01:22:12 [Epoch 1 Batch 40/82] loss=0.6718, lr=0.0000185, metrics:accuracy:0.5370
INFO:root:01:22:21 [Epoch 1 Batch 50/82] loss=0.6743, lr=0.0000175, metrics:accuracy:0.5518
INFO:root:01:22:31 [Epoch 1 Batch 60/82] loss=0.6894, lr=0.0000166, metrics:accuracy:0.5551
INFO:root:01:22:39 [Epoch 1 Batch 70/82] loss=0.6872, lr=0.0000156, metrics:accuracy:0.5587
INFO:root:01:22:48 [Epoch 1 Batch 80/82] loss=0.6626, lr=0.0000147, metrics:accuracy:0.5693
INFO:root:01:22:50 Now we are doing evaluation on dev with gpu(0).
INFO:root:01:22:51 [Batch 10/35] loss=0.6449, metrics:accuracy:0.6750
INFO:root:01:22:52 [Batch 20/35] loss=0.6266, metrics:accuracy:0.6813
INFO:root:01:22:54 [Batch 30/35] loss=0.6930, metrics:accuracy:0.6625
INFO:root:01:22:54 validation metrics:accuracy:0.6715
INFO:root:01:22:54 Time cost=4.00s, throughput=69.97 samples/s
INFO:root:01:22:55 params saved in: ./output_dir/model_bert_RTE_0.params
INFO:root:01:22:55 Time cost=79.30s
INFO:root:01:23:03 [Epoch 2 Batch 10/82] loss=0.5310, lr=0.0000135, metrics:accuracy:0.7719
INFO:root:01:23:12 [Epoch 2 Batch 20/82] loss=0.5022, lr=0.0000126, metrics:accuracy:0.7650
INFO:root:01:23:22 [Epoch 2 Batch 30/82] loss=0.4835, lr=0.0000116, metrics:accuracy:0.7733
INFO:root:01:23:31 [Epoch 2 Batch 40/82] loss=0.4762, lr=0.0000107, metrics:accuracy:0.7754
INFO:root:01:23:40 [Epoch 2 Batch 50/82] loss=0.4412, lr=0.0000097, metrics:accuracy:0.7728
INFO:root:01:23:48 [Epoch 2 Batch 60/82] loss=0.4915, lr=0.0000088, metrics:accuracy:0.7741
INFO:root:01:23:57 [Epoch 2 Batch 70/82] loss=0.4512, lr=0.0000078, metrics:accuracy:0.7767
INFO:root:01:24:05 [Epoch 2 Batch 80/82] loss=0.3897, lr=0.0000069, metrics:accuracy:0.7832
INFO:root:01:24:06 Now we are doing evaluation on dev with gpu(0).
INFO:root:01:24:08 [Batch 10/35] loss=0.6482, metrics:accuracy:0.7125
INFO:root:01:24:09 [Batch 20/35] loss=0.6311, metrics:accuracy:0.7125
INFO:root:01:24:10 [Batch 30/35] loss=0.7034, metrics:accuracy:0.7042
INFO:root:01:24:10 validation metrics:accuracy:0.7076
INFO:root:01:24:10 Time cost=4.00s, throughput=70.06 samples/s
INFO:root:01:24:11 params saved in: ./output_dir/model_bert_RTE_1.params
INFO:root:01:24:11 Time cost=76.11s
INFO:root:01:24:21 [Epoch 3 Batch 10/82] loss=0.2911, lr=0.0000057, metrics:accuracy:0.9125
INFO:root:01:24:30 [Epoch 3 Batch 20/82] loss=0.2762, lr=0.0000048, metrics:accuracy:0.9092
INFO:root:01:24:39 [Epoch 3 Batch 30/82] loss=0.2438, lr=0.0000038, metrics:accuracy:0.9121
INFO:root:01:24:47 [Epoch 3 Batch 40/82] loss=0.2719, lr=0.0000029, metrics:accuracy:0.9077
INFO:root:01:24:56 [Epoch 3 Batch 50/82] loss=0.2787, lr=0.0000019, metrics:accuracy:0.9054
INFO:root:01:25:05 [Epoch 3 Batch 60/82] loss=0.3279, lr=0.0000010, metrics:accuracy:0.9049
INFO:root:01:25:12 Finish training step: 233
INFO:root:01:25:12 Now we are doing evaluation on dev with gpu(0).
INFO:root:01:25:14 [Batch 10/35] loss=0.7463, metrics:accuracy:0.7125
INFO:root:01:25:15 [Batch 20/35] loss=0.6660, metrics:accuracy:0.7250
INFO:root:01:25:16 [Batch 30/35] loss=0.7802, metrics:accuracy:0.7125
INFO:root:01:25:16 validation metrics:accuracy:0.7112
INFO:root:01:25:16 Time cost=3.97s, throughput=70.60 samples/s
INFO:root:01:25:17 params saved in: ./output_dir/model_bert_RTE_2.params
INFO:root:01:25:17 Time cost=65.91s
INFO:root:01:25:17 Best model at epoch 2. Validation metrics:accuracy:0.7112
INFO:root:01:25:17 Now we are doing testing on test with gpu(0).
INFO:root:01:25:54 Time cost=36.38s, throughput=82.47 samples/s

cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0. Fixes a bug in apache#16408

* Fix transformer.cu interleaved matmul for cuda arch < 5 (#17596) cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0. Fixes a bug in #16408 * pin Markdown version to 3.1 in Julia doc build (#17549) * pin Sphinx due to autodocsumm issue with v4.2.0 (#17561) * pin python dependencies (#17556) * [CI] Fix static build pipeline (#17474) * 1.5.x CI fixes (#17426) * Fix numpy decorator * Workaround pytest-dev/pytest#5903 * Disable pylint warnings * Fix Edge build * Fix numpy decorator on Centos * Follow redirects when downloading apache-maven-3.3.9-bin.tar.gz Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Aaron Markham <[email protected]>

cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0. Fixes a bug in apache#16408

Fix transformer.cu

0fc7bea

leezu requested a review from ptrendx February 14, 2020 20:30

sxjscience approved these changes Feb 14, 2020

View reviewed changes

access2rohit approved these changes Feb 14, 2020

View reviewed changes

leezu merged commit d352673 into apache:master Feb 15, 2020

leezu deleted the fixtransformercu branch February 15, 2020 06:00

leezu added a commit to leezu/mxnet that referenced this pull request Feb 15, 2020

Fix transformer.cu interleaved matmul for cuda arch < 5 (apache#17596)

00b5400

cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0. Fixes a bug in apache#16408

leezu mentioned this pull request Feb 15, 2020

Backport #17596 #17603

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix transformer.cu interleaved matmul for cuda arch < 5 #17596

Fix transformer.cu interleaved matmul for cuda arch < 5 #17596

Uh oh!

leezu commented Feb 14, 2020

Uh oh!

access2rohit left a comment

Uh oh!

leezu commented Feb 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix transformer.cu interleaved matmul for cuda arch < 5 #17596

Fix transformer.cu interleaved matmul for cuda arch < 5 #17596

Uh oh!

Conversation

leezu commented Feb 14, 2020

Description

Changes

Comments

Uh oh!

access2rohit left a comment

Choose a reason for hiding this comment

Uh oh!

leezu commented Feb 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants