-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Move at::chunk into the graph fuser #10026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The design can be found here: - (quip) https://fb.quip.com/JLkbAyYc2AFy#IZGACAsKsBe - (gist) coming soon... Test Plan - Expect test and correctness test to see if a single chunk is fused by the graph fuser - Correctness test for a variety of chunks (dimension = beginning, middle, end) and tensors (contiguous, non-contiguous, edge case (splitSize = 1) for both CPU/CUDA - Expect test for multiple chunks fused into the same kernel and correctness test.
If all outputs to at::chunk are inputs to a FusionGroup and "chunks", "dim" are both constants, then the at::chunk is moved into the beginning of the FusionGroup.
The main changes are: 1) the compiler emits offset code to index into the input tensor to produce the output values 2) at the same time, the compiler injects a splitSize and splitStride args into the formals list. 3) The compiler stores a vector<ChunkDesc> in CompiledFusionFunction that describes all of the chunk operations. 4) When launching a kernel, the compiler checks each ChunkDesc to compute splitSize and splitStride and append those to the list of arguments.
bd9f67f to
f7075dc
Compare
Contributor
Author
|
Closing in favor of #10178 |
facebook-github-bot
pushed a commit
that referenced
this pull request
Aug 18, 2018
Summary: ... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from #10026 This is done through the following: 1) Absorb starting chunks into FusionGroup as a part of the graph fuser pass. 2) When compiling a kernel, emit a `std::vector<ConcatDesc>` that describes if an input (of the original graph) will be chunked. 3) When launching a kernel, `use std::vector<ConcatDesc>` to chunk an input tensor on the CPU. This chunk directly takes in an at::Tensor and creates four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors. - Expect test and correctness test to see if a single chunk is fused by the graph fuser - Correctness test for a variety of chunks (dimension = beginning, middle, end) and tensors (contiguous, non-contiguous, edge case (splitSize = 1) for both CPU/CUDA - Expect test for multiple chunks fused into the same kernel and correctness test. cc zdevito apaszke LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights. After changes: ``` thnn cudnn jit 8.8468 6.5797 9.3470 ``` Before changes: ``` thnn cudnn jit 9.9221 6.6539 11.2550 ``` Pull Request resolved: #10178 Differential Revision: D9382661 Pulled By: zou3519 fbshipit-source-id: 1f8a749208fbdd45559775ce98cf4eb9558448f8
PenghuiCheng
pushed a commit
to PenghuiCheng/pytorch
that referenced
this pull request
Sep 11, 2018
Summary: ... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from pytorch#10026 This is done through the following: 1) Absorb starting chunks into FusionGroup as a part of the graph fuser pass. 2) When compiling a kernel, emit a `std::vector<ConcatDesc>` that describes if an input (of the original graph) will be chunked. 3) When launching a kernel, `use std::vector<ConcatDesc>` to chunk an input tensor on the CPU. This chunk directly takes in an at::Tensor and creates four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors. - Expect test and correctness test to see if a single chunk is fused by the graph fuser - Correctness test for a variety of chunks (dimension = beginning, middle, end) and tensors (contiguous, non-contiguous, edge case (splitSize = 1) for both CPU/CUDA - Expect test for multiple chunks fused into the same kernel and correctness test. cc zdevito apaszke LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights. After changes: ``` thnn cudnn jit 8.8468 6.5797 9.3470 ``` Before changes: ``` thnn cudnn jit 9.9221 6.6539 11.2550 ``` Pull Request resolved: pytorch#10178 Differential Revision: D9382661 Pulled By: zou3519 fbshipit-source-id: 1f8a749208fbdd45559775ce98cf4eb9558448f8
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The design can be found here:
https://fb.quip.com/JLkbAyYc2AFy#IZGACAsKsBe
Test Plan
by the graph fuser
middle, end) and tensors (contiguous, non-contiguous, edge case
(splitSize = 1) for both CPU/CUDA
correctness test.
cc @zdevito @apaszke
I benchmarked a LSTM with numLayers = 1, seq_length = 100, input_size = hidden_size = 512, with all requires_grad=False. Script here: https://gist.github.com/zou3519/f0840f773f8835ee834b748148042e22.
Perf is very bad with requires_grad=True (both before & after my changes), so I am investigating that (#10032)
Before numbers (ms):
After numbers (ms):