Commit e263dd3
(#24396)
Summary:
Initial kernel support added for optimized NHWC tensor.
TODO: currently backwards kernel spits out tensor with NHWC stride.
Unfortunately autograd restores grad to contiguous (in either copy or add). This
makes real perf tuning annoying to do. (since I cannot easily measure end-to-end
time in my python script)
My current kernel is blazing fast comparing to the original NCHW kernel in fp16,
since I avoided atomicAdd. I'll finish perf tuning after we merged some future
PR expanding NHWC support in the core.
Pull Request resolved: #24396
Differential Revision: D18115941
Pulled By: VitalyFedyunin
fbshipit-source-id: 57b4922b7bf308430ffe1406681f68629baf88341 parent 2020cc0 commit e263dd3
File tree
4 files changed
+505
-79
lines changed- aten/src/ATen/native
- cuda
- test
4 files changed
+505
-79
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
326 | 326 | | |
327 | 327 | | |
328 | 328 | | |
329 | | - | |
| 329 | + | |
| 330 | + | |
330 | 331 | | |
331 | 332 | | |
332 | 333 | | |
| |||
0 commit comments