Skip to content

Conversation

@zou3519
Copy link
Contributor

@zou3519 zou3519 commented Jul 30, 2018

The design can be found here:

Test Plan

  • Expect test and correctness test to see if a single chunk is fused
    by the graph fuser
  • Correctness test for a variety of chunks (dimension = beginning,
    middle, end) and tensors (contiguous, non-contiguous, edge case
    (splitSize = 1) for both CPU/CUDA
  • Expect test for multiple chunks fused into the same kernel and
    correctness test.

cc @zdevito @apaszke

I benchmarked a LSTM with numLayers = 1, seq_length = 100, input_size = hidden_size = 512, with all requires_grad=False. Script here: https://gist.github.com/zou3519/f0840f773f8835ee834b748148042e22.

Perf is very bad with requires_grad=True (both before & after my changes), so I am investigating that (#10032)

Before numbers (ms):

thnn    cudnn   jit
9.7147  6.6332  13.6807

After numbers (ms):

thnn    cudnn   jit
9.6166  6.6871  9.4633

@zou3519 zou3519 added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jul 30, 2018
zou3519 added 4 commits July 31, 2018 10:59
The design can be found here:
- (quip)
https://fb.quip.com/JLkbAyYc2AFy#IZGACAsKsBe
- (gist) coming soon...

Test Plan

- Expect test and correctness test to see if a single chunk is fused
  by the graph fuser
- Correctness test for a variety of chunks (dimension = beginning,
  middle, end) and tensors (contiguous, non-contiguous, edge case
  (splitSize = 1) for both CPU/CUDA
- Expect test for multiple chunks fused into the same kernel and
  correctness test.
If all outputs to at::chunk are inputs to a FusionGroup and "chunks",
"dim" are both constants, then the at::chunk is moved into the beginning
of the FusionGroup.
The main changes are:
1) the compiler emits offset code to index into the input tensor to
produce the output values
2) at the same time, the compiler injects a splitSize and splitStride
args into the formals list.
3) The compiler stores a vector<ChunkDesc> in CompiledFusionFunction
that describes all of the chunk operations.
4) When launching a kernel, the compiler checks each ChunkDesc to
compute splitSize and splitStride and append those to the list of
arguments.
@zou3519
Copy link
Contributor Author

zou3519 commented Aug 2, 2018

Closing in favor of #10178

@zou3519 zou3519 closed this Aug 2, 2018
facebook-github-bot pushed a commit that referenced this pull request Aug 18, 2018
Summary:
... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from #10026

This is done through the following:

1) Absorb starting chunks into FusionGroup as a part of the graph fuser
pass.
2) When compiling a kernel, emit a `std::vector<ConcatDesc>` that describes if an input (of the original graph) will be chunked.
3) When launching a kernel, `use std::vector<ConcatDesc>` to chunk an
input tensor on the CPU. This chunk directly takes in an at::Tensor and creates
four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors.

- Expect test and correctness test to see if a single chunk is fused
  by the graph fuser
- Correctness test for a variety of chunks (dimension = beginning,
  middle, end) and tensors (contiguous, non-contiguous, edge case
  (splitSize = 1) for both CPU/CUDA
- Expect test for multiple chunks fused into the same kernel and
  correctness test.

cc zdevito apaszke

LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights.

After changes:
```
thnn    cudnn   jit
8.8468  6.5797  9.3470
```

Before changes:
```
thnn    cudnn   jit
9.9221  6.6539  11.2550
```
Pull Request resolved: #10178

Differential Revision: D9382661

Pulled By: zou3519

fbshipit-source-id: 1f8a749208fbdd45559775ce98cf4eb9558448f8
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
Summary:
... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from pytorch#10026

This is done through the following:

1) Absorb starting chunks into FusionGroup as a part of the graph fuser
pass.
2) When compiling a kernel, emit a `std::vector<ConcatDesc>` that describes if an input (of the original graph) will be chunked.
3) When launching a kernel, `use std::vector<ConcatDesc>` to chunk an
input tensor on the CPU. This chunk directly takes in an at::Tensor and creates
four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors.

- Expect test and correctness test to see if a single chunk is fused
  by the graph fuser
- Correctness test for a variety of chunks (dimension = beginning,
  middle, end) and tensors (contiguous, non-contiguous, edge case
  (splitSize = 1) for both CPU/CUDA
- Expect test for multiple chunks fused into the same kernel and
  correctness test.

cc zdevito apaszke

LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights.

After changes:
```
thnn    cudnn   jit
8.8468  6.5797  9.3470
```

Before changes:
```
thnn    cudnn   jit
9.9221  6.6539  11.2550
```
Pull Request resolved: pytorch#10178

Differential Revision: D9382661

Pulled By: zou3519

fbshipit-source-id: 1f8a749208fbdd45559775ce98cf4eb9558448f8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant