-
Notifications
You must be signed in to change notification settings - Fork 26.3k
nn.Transformer #20170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nn.Transformer #20170
Conversation
|
Nice progress. But I have several comments and I also don't see the test and doc entries. |
We haven't decided a unit test for the transformer model. I will add the doc entries. |
Add transformer doc string.
|
Looking much better! Just as @ssnl mentioned, please add unit tests. |
remove the dependency of numpy library in Transformer.py. Add costom_encoder and costom_decoder to Transformer class.
Add a unit test for Transformer -- test_transformer_args_check Add a unit test for Transformer -- TestNN.test_transformerencoderlayer Add a unit test for Transformer -- TestNN.test_transformerencoderlayer Add fixed numerical results. Update TestNN.test_transformer_args_check to include src_mask, tgt_mask, memory_mask arguments
… test_transformerencoderlayer.
|
Pinging @srush @kyunghyuncho @myleott @glample for additional feature requests / review. |
|
Pinging @stephenroller @douwekiela for additional feature requests / review. |
|
We also have a PR for applying torch.nn.Transformer in word language problem (pytorch/examples#555). @myleott |
torch/nn/modules/transformer.py
Outdated
|
|
||
| class Transformer(Module): | ||
| r"""A transformer model. User is able to modified the attributes as needed. The architechture | ||
| is based on the paper "Attention Is All You Need". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add full citation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangguanheng66 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
fix #10459 |
|
|
||
| def __init__(self, encoder_layer, num_layers, norm=None): | ||
| super(TransformerEncoder, self).__init__() | ||
| self.layers = _get_clones(encoder_layer, num_layers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intended behavior to have weight sharing? In a standard transformer (like from Attention is All You Need), I don't believe the weights are shared. I understand there has been followup work showing weight sharing can be beneficial in some cases, but if this is intended behavior, it might be useful to specify this difference from the paper's implementation in docs after the Vaswani et al. citation and cite the followup work.
As a broader point:
This seems to get at the fact that a Transformer is a rather high level component, without really consensus yet on its architecture with a lot decisions likely domain specific (weight sharing?, beam search decoder?, convolutions between layers?, output the intermediate attentions? hierarchy / grouping components of the task?, sparsity?... and many more ideas that are being rapidly published)
I don't mean to try and start needless debates, or to at all imply this Transformer code is not useful, but I will add in my thoughts that from my limited perspective that maybe Transformers might better belong in the contrib module or in the docs as a example that people can modify to meet their needs. The torch.nn.Modules module seems (at least currently) to be for more foundational components that can be composed into larger models, not full model architectures themselves.
There has already been fairly extensive discussion on this in #10459 , and it seemed like the consensus there was to focus on things like MultiHeadedAttention or PositionalEncoding for core and keep full architectures separate in the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an addendum to that second part: Thinking about this more I could see how a TransformerEncoder (or like StackedSelfAttention or RecurrentSelfAttention) could maybe be considered a primitive component which could lend itself to maybe eventually being low level optimized and composed into novel things (though not really that much more primitive than something like a ResNet-block). However, when it starts seeming particularly high level, and full-archetecture-y is the inclusion of a prepackaged encoder and decoder Transformer for seq2seq, which, without a lot of additional components, seems likely would not meet all needs and could not be easily adapted/composed.
I don't know, and I don't really have any reason to hold strong opinions here. Feel free to dismiss this second part without real justification...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DNGros Thanks for the comments. To your first question, it's not supposed to share the weights between layers. Actually, if you take a quick look at _get_clones function, it calls a "deepcopy" function.
For your second comment, we do realize there are many ongoing discussions about the transformer models and several variants in different domains. We want to provide a baseline for the research community and startups when people don't want to code from scratch. This module could help people try some preliminary ideas for the fast delivery. For some advanced users, we expect them to develop a specific transformer model, and they could possibly use this module as a reference or benchmark case.
As you suggest, we try to make the model highly "modularized". People can use nn.Tranformer, nn.TransformerEncoder, or EVEN nn.TransformerEncoderLayer, as needed. There is actually a word language example (pytorch/examples#555) where we use nn.TransformerEncoder as the seq2seq model.
In the future, if we see a variant of the transformer model requested widely by the community, we will continuously implement them in our framework. I see a baseline model could benefit more users as we optimize the module performance in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangguanheng66 Ok, sorry. I missed that.
Thanks for clarifying!
cpuhrsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge at will
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangguanheng66 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangguanheng66 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@zhangguanheng66 merged this pull request in 83cec5f. |
Create a PR for comments. The model is still WIP but I want to have some feedbacks before moving too far. The transformer model depends on several modules, like MultiheadAttention (landed).
Transformer is implemented based on the paper (https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf). Users have the flexibility to build a transformer with self-defined and/or built-in components (i.e encoder, decoder, encoder_layer, decoder_layer). Users could use Transformer class to build a standard transformer model and modify sub-layers as needed.
Add a few unit tests for the transformer module, as follow:
TestNN.test_Transformer_cell
TestNN.test_transformerencoderlayer
TestNN.test_transformerdecoderlayer
TestNN.test_transformer_args_check
TestScript.test_scriptmodule_transformer_cuda
There is another demonstration example for applying transformer module on the word language problem. pytorch/examples#555