Add key_padding_mask kwarg to Transformer #22588

lucasgadams · 2019-07-08T14:57:29Z

Motivation:
The forward method of MultiheadAttention has a kwarg a key_padding_mask. This mask is of shape (N,S) where N is batch and S is sequence length. This mask is applied prior to attention softmax where True values in the mask are set to float('-inf'). This allows you to mask position j from attention for all position i in input sequence. It's typically used to mask padded inputs. So for a sample in a batch we will be able to make sure no encoder outputs depend on padding inputs. Currently the Transformer, TransformerEncoder, and TransformerEncoderLayer do not have this kwarg, and only have options for a (S,S), (T,T), and (S,T) masks which are applied equally across the batch for source input, target output, and target-source memory respectively. These masks can't be used for padding and are instead used for things like subsequent masking in language modeling, by masking the attention of position i to position j.

This diff exposes the key_padding_mask to Transformer, TransformerEncoder, and TransformerEncoderLayer forward methods which is ultimately passed to MultiheadAttention forward.

Open question: should we also allow a key_padding_mask for the decoder layer? As padding is usually at the end of each sentence in a batch and sentences are usually decoding from left to right, usually people deal with padding on decoded outputs by just masking those outputs at the loss layer. There might be some scenarios where it's needed though I don't think it would be common. People can also still just subclass and override the layers. We could also pass the input key_padding_mask to the memory <> decoder attention layer. Not sure if that's necessary though because the output of position i from each attention encoder layer won't depend on any masked positions in the input (even if position i is a masked position itself) so there's not really any point in masking position i again.

Summary:
Adds the key_padding_mask kwarg to Transformer, TransformerEncoder, and TransformerEncoderLayer forward methods.
The standard TransformerEncoderLayer uses a MultiheadAttention layer as self_attn. MultiheadAttention forward method has a key_padding_mask kwarg that allows for masking of values such as padding per sequence in a batch, in contrast to the attn_mask kwarg which is usually of shape (S,S) and applied equally across the batch.

MultiheadAttention calls functional.multi_head_attention_forward, which has the same key_padding_mask kwarg of shape (N,S). Masked (True) values are set to float('-inf').

Differential Revision: D16112263

zhangguanheng66 · 2019-07-08T15:02:37Z

fix #22374 (comment)

zhangguanheng66 · 2019-07-08T15:03:35Z

@sebamenabar I think this PR is relevant to your feature request. Feel free to add review comments. Thanks.

zhangguanheng66 · 2019-07-08T15:26:01Z

@stephenroller @myleott @ngimel @DNGros @mttk a PR to add key_padding to transformer_encoder. Just wondering if we should do the same thing for transformer_decoder.

zhangguanheng66 · 2019-07-08T16:35:49Z

@lucasgadams please update the text with motivation for the PR.

test/test_nn.py

zhangguanheng66

After some discussions, I feel it makes some sense to have the key_padding in the decoder (at least for the self-attention layer). We prefer to maintain a generic API because we can't predict application on the user side.

sebamenabar · 2019-07-10T18:32:23Z

After some discussions, I feel it makes some sense to have the key_padding in the decoder (at least for the self-attention layer). We prefer to maintain a generic API because we can't predict application on the user side.

Agree, I think it also makes sense to add the key_padding_mask on the decoder's multihead attention that looks at the encoder output, since we usually have padding on the source and the target.

lucasgadams · 2019-07-11T01:34:05Z

Ok I have added the key_padding_mask kwarg to the Decoder and DecoderLayer which will pass it in to the target - memory attention. 1 more question is if we want to automatically pass the src key_padding_mask to both the encoder and decoder in Transformer forward method. This means people will be forced to used it on both if they use it on either in Transformer. I think this is probably fine and so that is what I have done in the latest diff. But let me know if you think we don't want to automatically do that in Transformer. I will add some test cases for the decoder piece.

torch/nn/modules/transformer.py

zhangguanheng66 · 2019-07-11T15:36:38Z

Ok I have added the key_padding_mask kwarg to the Decoder and DecoderLayer which will pass it in to the target - memory attention. 1 more question is if we want to automatically pass the src key_padding_mask to both the encoder and decoder in Transformer forward method. This means people will be forced to used it on both if they use it on either in Transformer. I think this is probably fine and so that is what I have done in the latest diff. But let me know if you think we don't want to automatically do that in Transformer. I will add some test cases for the decoder piece.

See my comments. I don't think we want to pass a single key_padding to both encoder and decoder. Thanks.

lucasgadams · 2019-07-11T18:00:48Z

I added the requested changes. Transformer.forward now uses src_key_padding_mask, tgt_key_padding_mask, and memory_key_padding_mask. Decoder uses key_padding_mask and memory_key_padding_mask. Modified the relevant tests and added some new deterministic ones for the decoder masks.

sebamenabar · 2019-07-11T18:36:32Z

I added the requested changes. Transformer.forward now uses src_key_padding_mask, tgt_key_padding_mask, and memory_key_padding_mask. Decoder uses key_padding_mask and memory_key_padding_mask. Modified the relevant tests and added some new deterministic ones for the decoder masks.

I think every thing is covered now, arguments look very long, but in my opinion all options are standard use cases of the transformer.

torch/nn/modules/transformer.py

zhangguanheng66

.

torch/nn/modules/transformer.py

facebook-github-bot

@lucasgadams has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zhangguanheng66 · 2019-07-16T00:47:52Z

@pytorchbot retest this please

pytorchbot · 2019-07-16T00:49:19Z

Sorry, only maintainers are authorized to rebase other people's PRs. Feel free to try again on one of your PRs!

(To learn more about this bot, see Bot commands.)

lucasgadams · 2019-07-16T13:22:01Z

@pytorchbot rebase this please

facebook-github-bot

@lucasgadams is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Motivation: The forward method of MultiheadAttention has a kwarg a key_padding_mask. This mask is of shape (N,S) where N is batch and S is sequence length. This mask is applied prior to attention softmax where True values in the mask are set to float('-inf'). This allows you to mask position j from attention for all position i in input sequence. It's typically used to mask padded inputs. So for a sample in a batch we will be able to make sure no encoder outputs depend on padding inputs. Currently the Transformer, TransformerEncoder, and TransformerEncoderLayer do not have this kwarg, and only have options for a (S,S), (T,T), and (S,T) masks which are applied equally across the batch for source input, target output, and target-source memory respectively. These masks can't be used for padding and are instead used for things like subsequent masking in language modeling, by masking the attention of position i to position j. This diff exposes the key_padding_mask to Transformer, TransformerEncoder, and TransformerEncoderLayer forward methods which is ultimately passed to MultiheadAttention forward. Open question: should we also allow a key_padding_mask for the decoder layer? As padding is usually at the end of each sentence in a batch and sentences are usually decoding from left to right, usually people deal with padding on decoded outputs by just masking those outputs at the loss layer. There might be some scenarios where it's needed though I don't think it would be common. People can also still just subclass and override the layers. We could also pass the input key_padding_mask to the memory <> decoder attention layer. Not sure if that's necessary though because the output of position i from each attention encoder layer won't depend on any masked positions in the input (even if position i is a masked position itself) so there's not really any point in masking position i again. Adds the key_padding_mask kwarg to Transformer, TransformerEncoder, and TransformerEncoderLayer forward methods. The standard TransformerEncoderLayer uses a MultiheadAttention layer as self_attn. MultiheadAttention forward method has a key_padding_mask kwarg that allows for masking of values such as padding per sequence in a batch, in contrast to the attn_mask kwarg which is usually of shape (S,S) and applied equally across the batch. MultiheadAttention calls functional.multi_head_attention_forward, which has the same key_padding_mask kwarg of shape (N,S). Masked (True) values are set to float('-inf'). Pull Request resolved: pytorch#22588 Test Plan: buck test mode/dev caffe2/test:nn -- 'test_transformerencoderlayer \(test_nn\.TestNN\)' buck test mode/dev caffe2/test:nn -- 'test_Transformer_cell \(test_nn\.TestNN\)' buck test mode/dev caffe2/test:nn -- 'test_transformer_args_check \(test_nn\.TestNN\)' Differential Revision: D16112263 fbshipit-source-id: c56e9ff409f6666253cfc9b1d23656981e6729d1

facebook-github-bot · 2019-07-16T20:34:21Z

@lucasgadams merged this pull request in c6fe864.

pytorchbot added the module: nn Related to torch.nn label Jul 8, 2019

zhangguanheng66 requested review from cpuhrsch and zhangguanheng66 July 8, 2019 15:01

lucasgadams force-pushed the export-D16112263 branch from c3c6380 to 74d55ca Compare July 8, 2019 17:33

zhangguanheng66 requested changes Jul 8, 2019

View reviewed changes

test/test_nn.py Outdated Show resolved Hide resolved

lucasgadams force-pushed the export-D16112263 branch from 74d55ca to 1f09027 Compare July 9, 2019 20:30

lucasgadams force-pushed the export-D16112263 branch from 1f09027 to 5bb9001 Compare July 9, 2019 21:15

zhangguanheng66 requested changes Jul 10, 2019

View reviewed changes

lucasgadams force-pushed the export-D16112263 branch from 5bb9001 to a4b629e Compare July 11, 2019 01:38

zhangguanheng66 requested changes Jul 11, 2019

View reviewed changes

torch/nn/modules/transformer.py Outdated Show resolved Hide resolved

torch/nn/modules/transformer.py Outdated Show resolved Hide resolved

lucasgadams force-pushed the export-D16112263 branch from a4b629e to ad0360a Compare July 11, 2019 18:05

zhangguanheng66 approved these changes Jul 11, 2019

View reviewed changes

zhangguanheng66 self-requested a review July 11, 2019 19:14

zhangguanheng66 requested changes Jul 11, 2019

View reviewed changes

lucasgadams force-pushed the export-D16112263 branch from ad0360a to 715aea1 Compare July 11, 2019 22:18

zhangguanheng66 requested changes Jul 12, 2019

View reviewed changes

torch/nn/modules/transformer.py Outdated Show resolved Hide resolved

zhangguanheng66 approved these changes Jul 12, 2019

View reviewed changes

facebook-github-bot reviewed Jul 12, 2019

View reviewed changes

lucasgadams force-pushed the export-D16112263 branch from 715aea1 to 238ee45 Compare July 12, 2019 18:33

lucasgadams force-pushed the export-D16112263 branch from 8a39fbe to 958dfa3 Compare July 16, 2019 13:33

facebook-github-bot reviewed Jul 16, 2019

View reviewed changes

lucasgadams force-pushed the export-D16112263 branch from 958dfa3 to 2cd03ec Compare July 16, 2019 17:57

facebook-github-bot closed this in c6fe864 Jul 16, 2019

facebook-github-bot added the merged label Jul 16, 2019

mruberry added the Merged label Oct 28, 2020

Add key_padding_mask kwarg to Transformer #22588

Add key_padding_mask kwarg to Transformer #22588

Uh oh!

Conversation

lucasgadams commented Jul 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangguanheng66 commented Jul 8, 2019

Uh oh!

zhangguanheng66 commented Jul 8, 2019

Uh oh!

zhangguanheng66 commented Jul 8, 2019

Uh oh!

zhangguanheng66 commented Jul 8, 2019

Uh oh!

Uh oh!

zhangguanheng66 left a comment

Choose a reason for hiding this comment

Uh oh!

sebamenabar commented Jul 10, 2019

Uh oh!

lucasgadams commented Jul 11, 2019

Uh oh!

Uh oh!

Uh oh!

zhangguanheng66 commented Jul 11, 2019

Uh oh!

lucasgadams commented Jul 11, 2019

Uh oh!

sebamenabar commented Jul 11, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhangguanheng66 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

zhangguanheng66 commented Jul 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorchbot commented Jul 16, 2019

Uh oh!

lucasgadams commented Jul 16, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lucasgadams commented Jul 8, 2019 •

edited

Loading

zhangguanheng66 commented Jul 16, 2019 •

edited

Loading