-
Notifications
You must be signed in to change notification settings - Fork 26.3k
C++ APIs TransformerEncoder #43187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ APIs TransformerEncoder #43187
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit e70c50b (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 24 times. |
Differential Revision: [D23182770](https://our.internmc.facebook.com/intern/diff/D23182770) [ghstack-poisoned]
Differential Revision: [D23182770](https://our.internmc.facebook.com/intern/diff/D23182770) [ghstack-poisoned]
Differential Revision: [D23182770](https://our.internmc.facebook.com/intern/diff/D23182770) [ghstack-poisoned]
Differential Revision: [D23182770](https://our.internmc.facebook.com/intern/diff/D23182770) [ghstack-poisoned]
| transformer_decoder_layer_test_helper(true); | ||
| } | ||
|
|
||
| void transformer_decoder_layer_test_helper_gelu(bool is_cuda) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering why we need this helper specific for gelu? any idea for relu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we have in the test_nn.py.
I merged them into one test function for encoderlayer, but the decoder layer is contributed by external user who strictly followed the test_nn.py. I will refactor this a bit later
| : d_model_(d_model), nhead_(nhead){} | ||
|
|
||
|
|
||
| TransformerEncoderOptions::TransformerEncoderOptions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we publish transformer cpp later? The transformer model in general comes with decoder module, which I guess will be published later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangguanheng66 Current situation is that we have 2 people working on the transformer. Decoder will be in another PR. Pushing this first, so the decoder developer is easy to sync. And we will have Transformer top layer next week.
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not a full review) asking some questions
|
|
||
| Tensor forward( | ||
| const Tensor& src, | ||
| const Tensor& src_mask = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second argument is called "mask" in Python:
pytorch/torch/nn/modules/transformer.py
Line 167 in cbdaa20
| def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor: |
I do prefer src_mask because it makes it clear what is getting masked...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangguanheng66 I make the name to src_mask for the same reason as @zou3519, agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to call src_mask and tgt_mask along with src_key_padding_mask and tgt_key_padding_mask (consistent with the python ones)
| // a. No way to know whether module in AnyModule has api to reset_parameters, so replace instead | ||
| // b. Allow user to add/delete normalization module when reset parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does reset_parameters() get called? is it a user API or something that is used in the framework?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 As I mentioned before, this is a convenient function to reset parameters within the module. It is not necessary to call it. And I believe, it won't be called in most of the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so the thing I am confused about here is:
- there's no reset_parameters() or _reset_parameters on TransformerEncoderLayer in Python. Some of our Python nn.Modules do have reset_parameters() that re-initializes parameters.
And one potential concern is, let's say you have an optimizer that takes in a list of parameters (disclaimer: I have no clue how optimizers in C++ work). Is the following possible?
- I create a TransformerEncoderLayer in C++
- I add pass the parameters of a TransformerEncoderLayer to an optimizer (in C++)
- then, I call reset_parameters() on TransformerEncoderLayer
- The optimizer holds onto references of the original parameters (and not the new ones!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not possible since the C++ module underlying is a shared pointer, and optimizer doesn't clone the module but just shallow copy the module, so this case should be ok.
But the reset_parameters() it self is tricky, I admit that, even in python impl, reset_parameters() has different usage in different modules and not always there and not always get called. There is a PR that trying to standarize it by making reset_parameter() virtual and moved that to base class, but this change doesn't make sense since half of the module doesn't have it and virtual will make extra cost. So currently, reset_parameters() is just an optional functions.
Differential Revision: [D23182770](https://our.internmc.facebook.com/intern/diff/D23182770) [ghstack-poisoned]
Differential Revision: [D23182770](https://our.internmc.facebook.com/intern/diff/D23182770) [ghstack-poisoned]
| // This constructor will create a new TransformerEncoderLayer obj based on passed in encoder_layer_options. | ||
| TransformerEncoderOptions(const TransformerEncoderLayerOptions& encoder_layer_options, int64_t num_layers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also keeps a shallow copy of the data in encoder_layer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519
No, TransformerEncoder will use encoder_layer_options as input, allocating a new TransformerEncdoerLayer.
This is the major difference between these two constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see now, thanks for the clarification
Differential Revision: [D23182770](https://our.internmc.facebook.com/intern/diff/D23182770) [ghstack-poisoned]
Differential Revision: [D23182770](https://our.internmc.facebook.com/intern/diff/D23182770) [ghstack-poisoned]
Codecov Report
@@ Coverage Diff @@
## gh/glaringlee/25/base #43187 +/- ##
======================================================
Coverage 69.40% 69.40%
======================================================
Files 378 378
Lines 46606 46606
======================================================
Hits 32346 32346
Misses 14260 14260 Continue to review full report at Codecov.
|
|
@glaringlee merged this pull request in 48e08f8. |
Stack from ghstack:
Differential Revision: D23182770