[DO NOT REVIEW] Channels last perf test #25102

VitalyFedyunin · 2019-08-23T16:08:22Z

This PR have combined changes of #23403 subtasks.

Added cudnn nhwc support for: 1. batch norm 2. convolution 3. convolution_transpose

suggest_memory_format has ambiguous meaning for two cases: 1. tensor with NCHW where C = 1. we could use stride of C as a hint to tell the intended memory format. 2. tensor with NCHW where H == W == 1. there's no way to identify the intended memory format from strides. Currently we fallback to NCHW whenever we see contiguous tensor. Hence avoiding ambiguity for some of the special cases.

The old implementation assumed `is_channels_last_contiguous_` to be mutually exclusive to `is_contiguous_`, which is not true. Properly set the flag by checking strides.

…e_empty_supports_memory_format

Initial kernel support added for optimized NHWC tensor. TODO: currently backwards kernel spits out tensor with NHWC stride. Unfortunately autograd restores grad to contiguous (in either copy or add). This makes real perf tuning annoying to do. (since I cannot easily measure end-to-end time in my python script) My current kernel is blazing fast comparing to the original NCHW kernel in fp16, since I avoided atomicAdd. I'll finish perf tuning after we merged some future PR expanding NHWC support in the core.

This reverts commit c7ece81.

…nels_last_perf_test

VitalyFedyunin · 2019-08-23T16:14:34Z

@jjsjann123 @ifedan FYI

jjsjann123 · 2019-08-25T20:11:28Z

BTW,

pytorch/torch/csrc/autograd/functions/accumulate_grad.cpp

Lines 42 to 59 in fc7f4e4

    
           if (!grad.defined()) { 
        
             // under following condition, we can avoid clone() 
        
             if (!GradMode::is_enabled() 
        
                 && !new_grad.is_sparse() 
        
                 && new_grad.is_contiguous() 
        
                 && new_grad.use_count() <= 1 + !post_hooks().empty()) { 
        
               // first check it is in first-order grad only mode 
        
               // then check not sparse before is_contiguous 
        
               // then check contiguous, otherwise later in place accumulation may fail 
        
               // and lastly, check it is the last reference before we grab it. 
        
               // If the function has post hooks (for example, a DDP allreduce hook), 
        
               // call_function in Engine.cpp will temporarily bump the refcount by one, hence the 
        
               // addition of !post_hooks().empty(). 
        
               variable.grad() = new_grad.detach(); 
        
             } else { 
        
               variable.grad() = new_grad.clone(); 
        
             } 
        
           } else if (!GradMode::is_enabled()) {

This guy clones the grad if it's not contiguous. In the benchmarking for my PR, that permutation kills the perf gain from NHWC :/

Not sure how it works out now as our clone() supports NHWC, but we may want to double check the kernels in a toy example to make sure there's no wasted permutation.

…nels_last_perf_test

Previous kernel does not stride on Channel dimension, and the kernel uses shared memory to store temporary result (to break data dependency -> code paralellism) Resulted in requesting more resources than what's available. Fixing: added striding on C to reduce shmem usage per CTA.

Updated cudnn API for batchnorm. Enabling the Extended API which provides semi-persistent batchnorm kernel that has better performance on NHWC layout. TODO: I made adjustments to the API as well as BN in JIT IR. But I haven't fully tested the JIT part yet. I should verify that in the final PR.

VitalyFedyunin and others added 30 commits August 5, 2019 14:20

Simplify tests that should cover all possible devices

e2c3380

More examples

4eb6dbf

[nhwc support]

397934d

Added cudnn nhwc support for: 1. batch norm 2. convolution 3. convolution_transpose

Better messages

1fbad8b

empty_like, to and clone now preserve memory format if necessary

8e09355

nhwc test cases added for batchnorm & conv

a07dc6b

code cleaning and comments

6ec5905

WIP

6fb84e4

resize_as_ memory format preservation

70dc62e

Add Tensor Iterator and some cuda functions memory propagation

86ee5ee

addressing review comments

012d341

removing obsolete change

52f0239

disable nhwc tests for rocm

c1aea56

[Fixing is_channels_last_contiguous_ & is_channels_last_ flag]

1a80ec7

The old implementation assumed `is_channels_last_contiguous_` to be mutually exclusive to `is_contiguous_`, which is not true. Properly set the flag by checking strides.

test case added

55e83d0

scatter and gather now support channels last

0a1be6f

removing tabs

22cb498

fixing pylint issue

e63a34a

Merge branch 'master' of https://github.com/pytorch/pytorch into clon…

b6044a5

…e_empty_supports_memory_format

Fix tests

0499117

Channel last memory format

f1eddc7

Channel last memory format

0e2f384

Added tests

17fab37

Channel last memory format

96ec19c

Channel last memory format

9d0a6cb

Channel last memory format

c7ece81

Revert "Channel last memory format"

e6b295a

This reverts commit c7ece81.

Channel last memory format

8b9ed8e

VitalyFedyunin added 4 commits August 23, 2019 08:58

Merge branch 'pr_24121' into channels_last_perf_test

bb08531

Merge branch 'pr_24396' into channels_last_perf_test

16d54a4

Merge branch 'pr_24872' into channels_last_perf_test

02d4a9d

Merge branch 'pr_23861' into channels_last_perf_test

972d444

Merge branch 'master' of https://github.com/pytorch/pytorch into chan…

68dc32a

…nels_last_perf_test

VitalyFedyunin added 3 commits August 23, 2019 10:00

Fixing merge problems

fc7f4e4

Merge branch 'pr_24113' into channels_last_perf_test

ef3d918

WIP

5d2a091

Merge branch 'master' of https://github.com/pytorch/pytorch into chan…

fc8f520

…nels_last_perf_test

VitalyFedyunin mentioned this pull request Aug 27, 2019

[cudnn nhwc support] #23861

Closed

jjsjann123 mentioned this pull request Aug 28, 2019

suggest_memory_format has ambiguity & cannot represent intended layout format for corner cases #24090

Open

VitalyFedyunin and others added 8 commits August 29, 2019 11:34

WIP

f4b9030

WIP

0805ecf

Remove unnecessary .contiguous as operator does it internally

303e0b4

Debug lines

3104a35

Debug lines

82580f2

Support Channels Last

e084fec

This was referenced Oct 18, 2019

[nhwc support for adaptive_avg_pool2d & adaptive_avg_pool2d_backward] #24396

Closed

[NHWC support] Revoking mutually exclusive requirement on channels last and contiguous tensor #24113

Closed

VitalyFedyunin closed this Dec 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DO NOT REVIEW] Channels last perf test #25102

[DO NOT REVIEW] Channels last perf test #25102

Uh oh!

VitalyFedyunin commented Aug 23, 2019

Uh oh!

VitalyFedyunin commented Aug 23, 2019

Uh oh!

jjsjann123 commented Aug 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[DO NOT REVIEW] Channels last perf test #25102

[DO NOT REVIEW] Channels last perf test #25102

Uh oh!

Conversation

VitalyFedyunin commented Aug 23, 2019

Uh oh!

VitalyFedyunin commented Aug 23, 2019

Uh oh!

jjsjann123 commented Aug 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants