Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Conversation

@leezu
Copy link
Contributor

@leezu leezu commented Feb 14, 2020

Description

cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0.

Fixes a bug in #16408

Changes

  • Fix transformer.cu interleaved matmul for cuda arch < 5

Comments

CC @Caenorst

@leezu leezu requested a review from ptrendx February 14, 2020 20:30
Copy link
Contributor

@access2rohit access2rohit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@leezu
Copy link
Contributor Author

leezu commented Feb 15, 2020

Verified this patch by finetuning Bert on P2 instance.

Verification was initially blocked / delayed by #17576 ...

% python finetune_classifier.py --task_name RTE --batch_size 32 --epochs 3 --gpu 0 --lr 2e-5
INFO:root:01:21:10 Namespace(accumulate=None, batch_size=32, bert_dataset='book_corpus_wiki_en_uncased', bert_model='bert_12_768_12', calib_mode='customize', deploy=False, dev_batch_size=8, dtype='float32', early_stop=None, epochs=3, epsilon=1e-06, gpu=0, log_interval=10, lr=2e-05, max_len=128, model_parameters=None, model_prefix=None, num_calib_batches=5, only_calibration=False, only_inference=False, optimizer='bertadam', output_dir='./output_dir', pretrained_bert_parameters=None, quantized_dtype='auto', round_to=None, seed=2, task_name='RTE', training_steps=None, warmup_ratio=0.1)
[01:21:12] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7501, which is older than the oldest version tested by CI (7600).  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
INFO:root:01:21:26 processing dataset...
INFO:root:01:21:35 Now we are doing BERT classification training on gpu(0)!
INFO:root:01:21:35 training steps=233
INFO:root:01:21:45 [Epoch 1 Batch 10/82] loss=0.7479, lr=0.0000078, metrics:accuracy:0.5507
INFO:root:01:21:54 [Epoch 1 Batch 20/82] loss=0.7263, lr=0.0000165, metrics:accuracy:0.5235
INFO:root:01:22:02 [Epoch 1 Batch 30/82] loss=0.6821, lr=0.0000194, metrics:accuracy:0.5306
INFO:root:01:22:12 [Epoch 1 Batch 40/82] loss=0.6718, lr=0.0000185, metrics:accuracy:0.5370
INFO:root:01:22:21 [Epoch 1 Batch 50/82] loss=0.6743, lr=0.0000175, metrics:accuracy:0.5518
INFO:root:01:22:31 [Epoch 1 Batch 60/82] loss=0.6894, lr=0.0000166, metrics:accuracy:0.5551
INFO:root:01:22:39 [Epoch 1 Batch 70/82] loss=0.6872, lr=0.0000156, metrics:accuracy:0.5587
INFO:root:01:22:48 [Epoch 1 Batch 80/82] loss=0.6626, lr=0.0000147, metrics:accuracy:0.5693
INFO:root:01:22:50 Now we are doing evaluation on dev with gpu(0).
INFO:root:01:22:51 [Batch 10/35] loss=0.6449, metrics:accuracy:0.6750
INFO:root:01:22:52 [Batch 20/35] loss=0.6266, metrics:accuracy:0.6813
INFO:root:01:22:54 [Batch 30/35] loss=0.6930, metrics:accuracy:0.6625
INFO:root:01:22:54 validation metrics:accuracy:0.6715
INFO:root:01:22:54 Time cost=4.00s, throughput=69.97 samples/s
INFO:root:01:22:55 params saved in: ./output_dir/model_bert_RTE_0.params
INFO:root:01:22:55 Time cost=79.30s
INFO:root:01:23:03 [Epoch 2 Batch 10/82] loss=0.5310, lr=0.0000135, metrics:accuracy:0.7719
INFO:root:01:23:12 [Epoch 2 Batch 20/82] loss=0.5022, lr=0.0000126, metrics:accuracy:0.7650
INFO:root:01:23:22 [Epoch 2 Batch 30/82] loss=0.4835, lr=0.0000116, metrics:accuracy:0.7733
INFO:root:01:23:31 [Epoch 2 Batch 40/82] loss=0.4762, lr=0.0000107, metrics:accuracy:0.7754
INFO:root:01:23:40 [Epoch 2 Batch 50/82] loss=0.4412, lr=0.0000097, metrics:accuracy:0.7728
INFO:root:01:23:48 [Epoch 2 Batch 60/82] loss=0.4915, lr=0.0000088, metrics:accuracy:0.7741
INFO:root:01:23:57 [Epoch 2 Batch 70/82] loss=0.4512, lr=0.0000078, metrics:accuracy:0.7767
INFO:root:01:24:05 [Epoch 2 Batch 80/82] loss=0.3897, lr=0.0000069, metrics:accuracy:0.7832
INFO:root:01:24:06 Now we are doing evaluation on dev with gpu(0).
INFO:root:01:24:08 [Batch 10/35] loss=0.6482, metrics:accuracy:0.7125
INFO:root:01:24:09 [Batch 20/35] loss=0.6311, metrics:accuracy:0.7125
INFO:root:01:24:10 [Batch 30/35] loss=0.7034, metrics:accuracy:0.7042
INFO:root:01:24:10 validation metrics:accuracy:0.7076
INFO:root:01:24:10 Time cost=4.00s, throughput=70.06 samples/s
INFO:root:01:24:11 params saved in: ./output_dir/model_bert_RTE_1.params
INFO:root:01:24:11 Time cost=76.11s
INFO:root:01:24:21 [Epoch 3 Batch 10/82] loss=0.2911, lr=0.0000057, metrics:accuracy:0.9125
INFO:root:01:24:30 [Epoch 3 Batch 20/82] loss=0.2762, lr=0.0000048, metrics:accuracy:0.9092
INFO:root:01:24:39 [Epoch 3 Batch 30/82] loss=0.2438, lr=0.0000038, metrics:accuracy:0.9121
INFO:root:01:24:47 [Epoch 3 Batch 40/82] loss=0.2719, lr=0.0000029, metrics:accuracy:0.9077
INFO:root:01:24:56 [Epoch 3 Batch 50/82] loss=0.2787, lr=0.0000019, metrics:accuracy:0.9054
INFO:root:01:25:05 [Epoch 3 Batch 60/82] loss=0.3279, lr=0.0000010, metrics:accuracy:0.9049
INFO:root:01:25:12 Finish training step: 233
INFO:root:01:25:12 Now we are doing evaluation on dev with gpu(0).
INFO:root:01:25:14 [Batch 10/35] loss=0.7463, metrics:accuracy:0.7125
INFO:root:01:25:15 [Batch 20/35] loss=0.6660, metrics:accuracy:0.7250
INFO:root:01:25:16 [Batch 30/35] loss=0.7802, metrics:accuracy:0.7125
INFO:root:01:25:16 validation metrics:accuracy:0.7112
INFO:root:01:25:16 Time cost=3.97s, throughput=70.60 samples/s
INFO:root:01:25:17 params saved in: ./output_dir/model_bert_RTE_2.params
INFO:root:01:25:17 Time cost=65.91s
INFO:root:01:25:17 Best model at epoch 2. Validation metrics:accuracy:0.7112
INFO:root:01:25:17 Now we are doing testing on test with gpu(0).
INFO:root:01:25:54 Time cost=36.38s, throughput=82.47 samples/s

@leezu leezu merged commit d352673 into apache:master Feb 15, 2020
@leezu leezu deleted the fixtransformercu branch February 15, 2020 06:00
leezu added a commit to leezu/mxnet that referenced this pull request Feb 15, 2020
cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0.

Fixes a bug in apache#16408
@leezu leezu mentioned this pull request Feb 15, 2020
leezu added a commit that referenced this pull request Feb 17, 2020
* Fix transformer.cu interleaved matmul for cuda arch < 5  (#17596)

cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0.

Fixes a bug in #16408

* pin Markdown version to 3.1 in Julia doc build (#17549)

* pin Sphinx due to autodocsumm issue with v4.2.0 (#17561)

* pin python dependencies (#17556)

* [CI] Fix static build pipeline (#17474)

* 1.5.x CI fixes (#17426)

* Fix numpy decorator

* Workaround pytest-dev/pytest#5903

* Disable pylint warnings

* Fix Edge build

* Fix numpy decorator on Centos

* Follow redirects when downloading apache-maven-3.3.9-bin.tar.gz

Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Aaron Markham <[email protected]>
zheyuye pushed a commit to zheyuye/incubator-mxnet that referenced this pull request Feb 19, 2020
cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0.

Fixes a bug in apache#16408
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020
cublasGemmBatchedEx is only supported for GPU with architecture capabilities equal or greater than 5.0.

Fixes a bug in apache#16408
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants