Skip to content

Conversation

@zhangguanheng66
Copy link
Contributor

@zhangguanheng66 zhangguanheng66 commented May 6, 2019

Create a PR for comments. The model is still WIP but I want to have some feedbacks before moving too far. The transformer model depends on several modules, like MultiheadAttention (landed).

Transformer is implemented based on the paper (https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf). Users have the flexibility to build a transformer with self-defined and/or built-in components (i.e encoder, decoder, encoder_layer, decoder_layer). Users could use Transformer class to build a standard transformer model and modify sub-layers as needed.

Add a few unit tests for the transformer module, as follow:
TestNN.test_Transformer_cell
TestNN.test_transformerencoderlayer
TestNN.test_transformerdecoderlayer
TestNN.test_transformer_args_check
TestScript.test_scriptmodule_transformer_cuda

There is another demonstration example for applying transformer module on the word language problem. pytorch/examples#555

@pytorchbot pytorchbot added the module: nn Related to torch.nn label May 6, 2019
@ssnl
Copy link
Collaborator

ssnl commented May 6, 2019

Nice progress. But I have several comments and I also don't see the test and doc entries.

@zhangguanheng66
Copy link
Contributor Author

Nice progress. But I have several comments and I also don't see the test and doc entries.

We haven't decided a unit test for the transformer model. I will add the doc entries.

Guanheng Zhang added 2 commits May 6, 2019 12:09
@cpuhrsch
Copy link
Contributor

cpuhrsch commented May 7, 2019

Looking much better! Just as @ssnl mentioned, please add unit tests.

Guanheng Zhang added 2 commits May 8, 2019 10:08
remove the dependency of numpy library in Transformer.py.
Add costom_encoder and costom_decoder to Transformer class.
Add a unit test for Transformer -- test_transformer_args_check
Add a unit test for Transformer -- TestNN.test_transformerencoderlayer
Add a unit test for Transformer -- TestNN.test_transformerencoderlayer
Add fixed numerical results.
Update TestNN.test_transformer_args_check to include src_mask, tgt_mask, memory_mask arguments
@pytorchbot pytorchbot added the module: docs Related to our documentation, both in docs/ and docblocks label May 10, 2019
@cpuhrsch
Copy link
Contributor

cpuhrsch commented May 13, 2019

Pinging @srush @kyunghyuncho @myleott @glample for additional feature requests / review.

@cpuhrsch cpuhrsch changed the title [WIP] A new PR for transformer model. nn.Transformer May 13, 2019
@cpuhrsch
Copy link
Contributor

Pinging @mansimov @jasonleeinf @myleott @fmassa for additional feature requests / review.

@cpuhrsch
Copy link
Contributor

Pinging @stephenroller @douwekiela for additional feature requests / review.

@zhangguanheng66
Copy link
Contributor Author

We also have a PR for applying torch.nn.Transformer in word language problem (pytorch/examples#555). @myleott


class Transformer(Module):
r"""A transformer model. User is able to modified the attributes as needed. The architechture
is based on the paper "Attention Is All You Need".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add full citation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@pytorchbot pytorchbot added the oncall: jit Add this issue/PR to JIT oncall triage queue label May 28, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangguanheng66 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@zhangguanheng66
Copy link
Contributor Author

fix #10459


def __init__(self, encoder_layer, num_layers, norm=None):
super(TransformerEncoder, self).__init__()
self.layers = _get_clones(encoder_layer, num_layers)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intended behavior to have weight sharing? In a standard transformer (like from Attention is All You Need), I don't believe the weights are shared. I understand there has been followup work showing weight sharing can be beneficial in some cases, but if this is intended behavior, it might be useful to specify this difference from the paper's implementation in docs after the Vaswani et al. citation and cite the followup work.


As a broader point:
This seems to get at the fact that a Transformer is a rather high level component, without really consensus yet on its architecture with a lot decisions likely domain specific (weight sharing?, beam search decoder?, convolutions between layers?, output the intermediate attentions? hierarchy / grouping components of the task?, sparsity?... and many more ideas that are being rapidly published)

I don't mean to try and start needless debates, or to at all imply this Transformer code is not useful, but I will add in my thoughts that from my limited perspective that maybe Transformers might better belong in the contrib module or in the docs as a example that people can modify to meet their needs. The torch.nn.Modules module seems (at least currently) to be for more foundational components that can be composed into larger models, not full model architectures themselves.

There has already been fairly extensive discussion on this in #10459 , and it seemed like the consensus there was to focus on things like MultiHeadedAttention or PositionalEncoding for core and keep full architectures separate in the codebase.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an addendum to that second part: Thinking about this more I could see how a TransformerEncoder (or like StackedSelfAttention or RecurrentSelfAttention) could maybe be considered a primitive component which could lend itself to maybe eventually being low level optimized and composed into novel things (though not really that much more primitive than something like a ResNet-block). However, when it starts seeming particularly high level, and full-archetecture-y is the inclusion of a prepackaged encoder and decoder Transformer for seq2seq, which, without a lot of additional components, seems likely would not meet all needs and could not be easily adapted/composed.
I don't know, and I don't really have any reason to hold strong opinions here. Feel free to dismiss this second part without real justification...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DNGros Thanks for the comments. To your first question, it's not supposed to share the weights between layers. Actually, if you take a quick look at _get_clones function, it calls a "deepcopy" function.

For your second comment, we do realize there are many ongoing discussions about the transformer models and several variants in different domains. We want to provide a baseline for the research community and startups when people don't want to code from scratch. This module could help people try some preliminary ideas for the fast delivery. For some advanced users, we expect them to develop a specific transformer model, and they could possibly use this module as a reference or benchmark case.

As you suggest, we try to make the model highly "modularized". People can use nn.Tranformer, nn.TransformerEncoder, or EVEN nn.TransformerEncoderLayer, as needed. There is actually a word language example (pytorch/examples#555) where we use nn.TransformerEncoder as the seq2seq model.

In the future, if we see a variant of the transformer model requested widely by the community, we will continuously implement them in our framework. I see a baseline model could benefit more users as we optimize the module performance in the future.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangguanheng66 Ok, sorry. I missed that.

Thanks for clarifying!

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge at will

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangguanheng66 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangguanheng66 is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@zhangguanheng66 merged this pull request in 83cec5f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: docs Related to our documentation, both in docs/ and docblocks module: nn Related to torch.nn oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants