-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Bug
Some test_qr CUDA tests in test_autograd are taking very long time
To Reproduce
The environments are: Nvidia driver 450.51.06, cuda 11.2, [conda] MKL 2019.1, [conda] magma-cuda110 2.5.2
Commit f4fc3e3
Steps to reproduce the behavior:
IN_CI=1 python test/test_autograd.py -v -k TestAutogradDeviceTypeCUDA.test_qr
DGX A100, AMD EPYC 7742 64-Core Processor
Running tests...
----------------------------------------------------------------------
test_qr_square_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (129.655s)
test_qr_square_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (30.834s)
test_qr_square_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (467.517s)
test_qr_square_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (123.556s)
test_qr_square_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (13.334s)
test_qr_square_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (4.016s)
test_qr_tall_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (66.231s)
test_qr_tall_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (16.792s)
test_qr_tall_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (258.459s)
test_qr_tall_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (65.790s)
test_qr_tall_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (7.513s)
test_qr_tall_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (2.209s)
test_qr_wide_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (67.090s)
test_qr_wide_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (17.168s)
test_qr_wide_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (259.270s)
test_qr_wide_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (66.223s)
test_qr_wide_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (7.477s)
test_qr_wide_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (2.127s)
----------------------------------------------------------------------
Ran 18 tests in 1605.262s
OK
DGX V100, Intel(R) Xeon(R) CPU E5-2698 v4
Running tests...
----------------------------------------------------------------------
test_qr_square_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (75.711s)
test_qr_square_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (17.111s)
test_qr_square_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (255.971s)
test_qr_square_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (65.019s)
test_qr_square_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (7.962s)
test_qr_square_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (2.096s)
test_qr_tall_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (36.906s)
test_qr_tall_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (9.634s)
test_qr_tall_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (141.587s)
test_qr_tall_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (36.068s)
test_qr_tall_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (4.386s)
test_qr_tall_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (1.214s)
test_qr_wide_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (37.236s)
test_qr_wide_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (9.739s)
test_qr_wide_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (142.034s)
test_qr_wide_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (36.326s)
test_qr_wide_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (4.504s)
test_qr_wide_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (1.220s)
----------------------------------------------------------------------
Ran 18 tests in 884.726s
OK
Circle CI, (some 16 threads CPU?), 450.51.06, M60, MKL 2020.0.2, Magma 2.5.2
pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test, c77fc2e
Feb 02 05:45:06 test_qr_square_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (114.230s)
Feb 02 05:45:36 test_qr_square_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (29.400s)
Feb 02 05:52:54 test_qr_square_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (438.641s)
Feb 02 05:54:46 test_qr_square_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (111.259s)
Feb 02 05:55:00 test_qr_square_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (14.097s)
Feb 02 05:55:04 test_qr_square_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (3.838s)
Feb 02 05:56:07 test_qr_tall_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (63.712s)
Feb 02 05:56:24 test_qr_tall_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (16.879s)
Feb 02 06:00:32 test_qr_tall_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (247.478s)
Feb 02 06:01:35 test_qr_tall_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (63.282s)
Feb 02 06:01:43 test_qr_tall_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (8.116s)
Feb 02 06:01:45 test_qr_tall_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (2.291s)
Feb 02 06:02:51 test_qr_wide_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (65.269s)
Feb 02 06:03:07 test_qr_wide_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (16.856s)
Feb 02 06:07:12 test_qr_wide_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (244.314s)
Feb 02 06:08:14 test_qr_wide_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (62.452s)
Feb 02 06:08:22 test_qr_wide_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (8.055s)
Feb 02 06:08:25 test_qr_wide_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (2.285s)
As a rough estimation, this takes ~25 minutes on Circle CI.
Expected behavior
Tests should be much faster for qr.
Environment
See above
Additional context
Also, we are seeing some MAGMA failures while running test_qr... with a new driver
test_qr_square_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ...
python: /opt/conda/conda-bld/magma-cuda110_1611787720491/work/interface_cuda/interface.cpp:806:
void magma_queue_create_internal(magma_device_t, magma_queue**, const char*, const char*, int):
Assertion `queue->dAarray__ != NULL' failed.
cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @anjali411 @dylanbespalko @mruberry @ngimel @jianyuh @heitorschueroff @walterddr @IvanYashchuk @VitalyFedyunin @ptrblck