Skip to content

Some test_qr CUDA tests in test_autograd are taking very long time #51552

@xwang233

Description

@xwang233

🐛 Bug

Some test_qr CUDA tests in test_autograd are taking very long time

To Reproduce

The environments are: Nvidia driver 450.51.06, cuda 11.2, [conda] MKL 2019.1, [conda] magma-cuda110 2.5.2

Commit f4fc3e3

Steps to reproduce the behavior:

IN_CI=1 python test/test_autograd.py -v -k TestAutogradDeviceTypeCUDA.test_qr

DGX A100, AMD EPYC 7742 64-Core Processor

Running tests...
----------------------------------------------------------------------
  test_qr_square_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (129.655s)
  test_qr_square_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (30.834s)
  test_qr_square_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (467.517s)
  test_qr_square_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (123.556s)
  test_qr_square_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (13.334s)
  test_qr_square_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (4.016s)
  test_qr_tall_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (66.231s)
  test_qr_tall_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (16.792s)
  test_qr_tall_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (258.459s)
  test_qr_tall_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (65.790s)
  test_qr_tall_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (7.513s)
  test_qr_tall_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (2.209s)
  test_qr_wide_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (67.090s)
  test_qr_wide_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (17.168s)
  test_qr_wide_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (259.270s)
  test_qr_wide_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (66.223s)
  test_qr_wide_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (7.477s)
  test_qr_wide_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (2.127s)

----------------------------------------------------------------------
Ran 18 tests in 1605.262s

OK

DGX V100, Intel(R) Xeon(R) CPU E5-2698 v4

Running tests...
----------------------------------------------------------------------
  test_qr_square_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (75.711s)
  test_qr_square_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (17.111s)
  test_qr_square_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (255.971s)
  test_qr_square_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (65.019s)
  test_qr_square_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (7.962s)
  test_qr_square_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (2.096s)
  test_qr_tall_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (36.906s)
  test_qr_tall_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (9.634s)
  test_qr_tall_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (141.587s)
  test_qr_tall_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (36.068s)
  test_qr_tall_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (4.386s)
  test_qr_tall_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (1.214s)
  test_qr_wide_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (37.236s)
  test_qr_wide_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (9.739s)
  test_qr_wide_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (142.034s)
  test_qr_wide_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (36.326s)
  test_qr_wide_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (4.504s)
  test_qr_wide_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... OK (1.220s)

----------------------------------------------------------------------
Ran 18 tests in 884.726s

OK

Circle CI, (some 16 threads CPU?), 450.51.06, M60, MKL 2020.0.2, Magma 2.5.2

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test, c77fc2e

https://circleci.com/api/v1.1/project/github/pytorch/pytorch/10591651/output/107/0?file=true&allocation-id=6018c9b4876001051c9a907b-0-build%2F94F4739

Feb 02 05:45:06   test_qr_square_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (114.230s)
Feb 02 05:45:36   test_qr_square_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (29.400s)
Feb 02 05:52:54   test_qr_square_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (438.641s)
Feb 02 05:54:46   test_qr_square_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (111.259s)
Feb 02 05:55:00   test_qr_square_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (14.097s)
Feb 02 05:55:04   test_qr_square_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (3.838s)

Feb 02 05:56:07   test_qr_tall_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (63.712s)
Feb 02 05:56:24   test_qr_tall_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (16.879s)
Feb 02 06:00:32   test_qr_tall_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (247.478s)
Feb 02 06:01:35   test_qr_tall_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (63.282s)
Feb 02 06:01:43   test_qr_tall_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (8.116s)
Feb 02 06:01:45   test_qr_tall_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (2.291s)

Feb 02 06:02:51   test_qr_wide_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (65.269s)
Feb 02 06:03:07   test_qr_wide_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (16.856s)
Feb 02 06:07:12   test_qr_wide_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (244.314s)
Feb 02 06:08:14   test_qr_wide_many_batched_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (62.452s)
Feb 02 06:08:22   test_qr_wide_single_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (8.055s)
Feb 02 06:08:25   test_qr_wide_single_cuda (__main__.TestAutogradDeviceTypeCUDA) ... ok (2.285s)

As a rough estimation, this takes ~25 minutes on Circle CI.

Expected behavior

Tests should be much faster for qr.

Environment

See above

Additional context

Also, we are seeing some MAGMA failures while running test_qr... with a new driver

test_qr_square_many_batched_complex_cuda (__main__.TestAutogradDeviceTypeCUDA) ... 
python: /opt/conda/conda-bld/magma-cuda110_1611787720491/work/interface_cuda/interface.cpp:806: 
void magma_queue_create_internal(magma_device_t, magma_queue**, const char*, const char*, int): 
Assertion `queue->dAarray__ != NULL' failed.

xref #42666 #47953

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @anjali411 @dylanbespalko @mruberry @ngimel @jianyuh @heitorschueroff @walterddr @IvanYashchuk @VitalyFedyunin @ptrblck

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: autogradRelated to torch.autograd, and the autograd engine in generalmodule: complexRelated to complex number support in PyTorchmodule: cudaRelated to torch.cuda, and CUDA support in generalmodule: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulmodule: magmarelated to magma linear algebra cuda supportmodule: performanceIssues related to performance, either of kernel code or framework gluemodule: testsIssues related to tests (not the torch.testing module)triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions