Move at::chunk into the graph fuser #10026

zou3519 · 2018-07-30T20:44:23Z

The design can be found here:

(quip)
https://fb.quip.com/JLkbAyYc2AFy#IZGACAsKsBe
(gist) coming soon...

Test Plan

Expect test and correctness test to see if a single chunk is fused
by the graph fuser
Correctness test for a variety of chunks (dimension = beginning,
middle, end) and tensors (contiguous, non-contiguous, edge case
(splitSize = 1) for both CPU/CUDA
Expect test for multiple chunks fused into the same kernel and
correctness test.

I benchmarked a LSTM with numLayers = 1, seq_length = 100, input_size = hidden_size = 512, with all requires_grad=False. Script here: https://gist.github.com/zou3519/f0840f773f8835ee834b748148042e22.

Perf is very bad with requires_grad=True (both before & after my changes), so I am investigating that (#10032)

Before numbers (ms):

thnn    cudnn   jit
9.7147  6.6332  13.6807

After numbers (ms):

thnn    cudnn   jit
9.6166  6.6871  9.4633

The design can be found here: - (quip) https://fb.quip.com/JLkbAyYc2AFy#IZGACAsKsBe - (gist) coming soon... Test Plan - Expect test and correctness test to see if a single chunk is fused by the graph fuser - Correctness test for a variety of chunks (dimension = beginning, middle, end) and tensors (contiguous, non-contiguous, edge case (splitSize = 1) for both CPU/CUDA - Expect test for multiple chunks fused into the same kernel and correctness test.

If all outputs to at::chunk are inputs to a FusionGroup and "chunks", "dim" are both constants, then the at::chunk is moved into the beginning of the FusionGroup.

The main changes are: 1) the compiler emits offset code to index into the input tensor to produce the output values 2) at the same time, the compiler injects a splitSize and splitStride args into the formals list. 3) The compiler stores a vector<ChunkDesc> in CompiledFusionFunction that describes all of the chunk operations. 4) When launching a kernel, the compiler checks each ChunkDesc to compute splitSize and splitStride and append those to the list of arguments.

zou3519 · 2018-08-02T20:53:05Z

Closing in favor of #10178

Summary: ... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from #10026 This is done through the following: 1) Absorb starting chunks into FusionGroup as a part of the graph fuser pass. 2) When compiling a kernel, emit a `std::vector<ConcatDesc>` that describes if an input (of the original graph) will be chunked. 3) When launching a kernel, `use std::vector<ConcatDesc>` to chunk an input tensor on the CPU. This chunk directly takes in an at::Tensor and creates four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors. - Expect test and correctness test to see if a single chunk is fused by the graph fuser - Correctness test for a variety of chunks (dimension = beginning, middle, end) and tensors (contiguous, non-contiguous, edge case (splitSize = 1) for both CPU/CUDA - Expect test for multiple chunks fused into the same kernel and correctness test. cc zdevito apaszke LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights. After changes: ``` thnn cudnn jit 8.8468 6.5797 9.3470 ``` Before changes: ``` thnn cudnn jit 9.9221 6.6539 11.2550 ``` Pull Request resolved: #10178 Differential Revision: D9382661 Pulled By: zou3519 fbshipit-source-id: 1f8a749208fbdd45559775ce98cf4eb9558448f8

Summary: ... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from pytorch#10026 This is done through the following: 1) Absorb starting chunks into FusionGroup as a part of the graph fuser pass. 2) When compiling a kernel, emit a `std::vector<ConcatDesc>` that describes if an input (of the original graph) will be chunked. 3) When launching a kernel, `use std::vector<ConcatDesc>` to chunk an input tensor on the CPU. This chunk directly takes in an at::Tensor and creates four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors. - Expect test and correctness test to see if a single chunk is fused by the graph fuser - Correctness test for a variety of chunks (dimension = beginning, middle, end) and tensors (contiguous, non-contiguous, edge case (splitSize = 1) for both CPU/CUDA - Expect test for multiple chunks fused into the same kernel and correctness test. cc zdevito apaszke LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights. After changes: ``` thnn cudnn jit 8.8468 6.5797 9.3470 ``` Before changes: ``` thnn cudnn jit 9.9221 6.6539 11.2550 ``` Pull Request resolved: pytorch#10178 Differential Revision: D9382661 Pulled By: zou3519 fbshipit-source-id: 1f8a749208fbdd45559775ce98cf4eb9558448f8

zou3519 requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners July 30, 2018 20:44

zou3519 added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jul 30, 2018

zou3519 added 4 commits July 31, 2018 10:59

Absorb starting at::chunk into FusionGroups

380a91b

If all outputs to at::chunk are inputs to a FusionGroup and "chunks", "dim" are both constants, then the at::chunk is moved into the beginning of the FusionGroup.

Update expect tests

f7075dc

zou3519 force-pushed the pytorch-chunkfusion branch from bd9f67f to f7075dc Compare July 31, 2018 18:25

zou3519 mentioned this pull request Aug 2, 2018

Move at::chunk into the graph fuser #10178

Closed

zou3519 closed this Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move at::chunk into the graph fuser #10026

Move at::chunk into the graph fuser #10026

Uh oh!

zou3519 commented Jul 30, 2018 •

edited

Loading

Uh oh!

zou3519 commented Aug 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Move at::chunk into the graph fuser #10026

Move at::chunk into the graph fuser #10026

Uh oh!

Conversation

zou3519 commented Jul 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 commented Aug 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zou3519 commented Jul 30, 2018 •

edited

Loading