Update MultiheadAttention module support key/value with different number of features and allow static key/value #21288

zhangguanheng66 · 2019-06-03T16:20:51Z

The changes include:

Allow key/value to have different number of features with query. It supports the case when key and value have different feature dimensions.
Support three separate proj_weight, in addition to a single in_proj_weight. The proj_weight of key and value may have different dimension with that of query so three separate proj_weights are necessary. In case that key and value have same dimension as query, it is preferred to use a single large proj_weight for performance reason. However, it should be noted that using a single large weight or three separate weights is a size-dependent decision.
Give an option to use static k and v in the multihead_attn operator (see saved_k and saved_v). Those static key/value tensors can now be re-used when training the model.
Add more test cases to cover the arguments.

Note: current users should not be affected by the changes.

…tions

…ule_multi_head_attn_cuda).

…ntion module.

torch/nn/modules/activation.py

soumith · 2019-06-03T16:57:02Z

torch/nn/modules/activation.py


-        self.in_proj_weight = Parameter(torch.empty(3 * embed_dim, embed_dim))
+        self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
+        self.k_proj_weight = Parameter(torch.Tensor(embed_dim, self.kdim))


when you load older models / older models' state-dicts, the model will break / compute wrong result. what are you going to do about that?

you can do what batchnorm did. see https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/batchnorm.py#L16 and look at _version = 2, and https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/batchnorm.py#L90-L103

Thanks for pointing this out.

add _load_from_state_dict() function to detect in_proj_weight. If in_proj_weight exists in state_dict, it will split in_proj_weight into three separate weights. I assume users know how to use state_dict() and load_state_dict() to resolve this type of version conflict.

Test on the word language model

load the model based by the old module, which has in_proj_weight.
model = torch.load("old.pt")

generate state_dict.
state_dict = model.state_dict()

create a model based on the new module, which has three separate weights.
new_model = new_module()

map state on the new model.
new_model.load_state_dict(state_dict)

cpuhrsch · 2019-06-06T20:37:36Z

torch/nn/functional.py

            be ignored by the attention.
        need_weights: output attn_output_weights.
        attn_mask: mask that prevents attention to certain positions.
+        use_chunk_proj_weight: use in_proj_weight insteady of q_proj_weight, k_proj_weight, v_proj_weight.


Nit: "instead"

Also, this means that passing the flags for q, k, v weights is now the default? That's breaking backwards compatibility. Is that worth it?

It preserves backwards compatiblity (use_chunk_proj_weight=True will use in_proj_weight, like before) but the name is confusing.

I have added more docs for use_chunk_proj_weight

cpuhrsch · 2019-06-06T20:38:03Z

torch/nn/functional.py

          the embedding dimension.
        - key_padding_mask: :math:`(N, S)`, ByteTensor, where N is the batch size, S is the source sequence length.
        - attn_mask: :math:`(L, S)` where L is the target sequence length, S is the source sequence length.
+        - saved_k: :math:`(N*num_heads, S, E/num_heads)`, where S is the source sequence length, 


Why is it called "saved"?

we can use "static_k, static_v".

…proj_weight.

zhangguanheng66 · 2019-06-07T15:15:22Z

fix #21518

facebook-github-bot

@zhangguanheng66 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@zhangguanheng66 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

cpuhrsch · 2019-06-17T19:52:13Z

torch/nn/functional.py

+                                 q_proj_weight=None,              # type: Optional[Tensor]
+                                 k_proj_weight=None,              # type: Optional[Tensor]
+                                 v_proj_weight=None,              # type: Optional[Tensor]
+                                 static_k=None,                   # type: Optional[Tensor]


I think we had talked about maybe moving this into a separate function by splitting multi_head_attention_forward out into two, one part that does the projections and the other that consumes them. Users that want to do their own projections can then call the latter.

But it may not really help the layout of proj_weights issues. We still have to maintain the two layouts at the same time, right?

If you (or user) pass separate q_proj_weight, k_proj_weight, v_proj_weight the layout is pretty flexible, and 2 layouts are not really necessary. But this implies that (worst case) 3 gemms will be called instead of 1, which may or may not hurt performance.

@ngimel yeap. @cpuhrsch has this performance concern so we keep the single proj_weight option here. Only use the three gemms option when key/value have different embed dimension with query.

fwiw, we used 3 separate gemms in mlperf submission, because then .contiguous() calls here

pytorch/torch/nn/functional.py

Lines 3276 to 3280 in 6d1f0da

q = q.contiguous().view(tgt_len, bsz * num_heads, head_dim).transpose(0, 1)

if k is not None:

k = k.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)

if v is not None:

v = v.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)

become no-ops, so it compensates for gemms potentially being less efficient. Overall which approach is preferable is size-dependent.
Alternatively, pointers and strides can be checked for projection weights and if they point to a contiguous matrix a single gemm can be called.

ngimel · 2019-06-17T20:32:22Z

test/test_nn.py

                    V_fc = np.concatenate((V_fc, np.repeat(bias_v, V_fc.shape[0], axis=0)), axis=1)
-                    attn_mask = np.concatenate((attn_mask, np.ones([1, 1])), axis=1)
+                    if attn_mask is not None:
+                        attn_mask = np.concatenate((attn_mask, np.ones([1, 1])), axis=1)


What are these np functions everywhere and how will they work with cuda tensors?
Edit: sorry, nevermind, did not notice this was specifically for a test.

they are used to re-write the multiheadattention function and verify the results of nn.MulitheadAttention module.

facebook-github-bot

@cpuhrsch has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@zhangguanheng66 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

cpuhrsch · 2019-07-02T14:53:49Z

torch/nn/functional.py

+                k, v = linear(key, _w, _b).chunk(2, dim=-1)
+
        else:
+            _b = in_proj_bias


Since this is getting quite complicated it could be good to add some comments.

facebook-github-bot

@zhangguanheng66 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-07-03T04:06:17Z

@zhangguanheng66 merged this pull request in bb0f299.

…ber of features and allow static key/value (pytorch#21288) Summary: The changes include: 1. Allow key/value to have different number of features with query. It supports the case when key and value have different feature dimensions. 2. Support three separate proj_weight, in addition to a single in_proj_weight. The proj_weight of key and value may have different dimension with that of query so three separate proj_weights are necessary. In case that key and value have same dimension as query, it is preferred to use a single large proj_weight for performance reason. However, it should be noted that using a single large weight or three separate weights is a size-dependent decision. 3. Give an option to use static k and v in the multihead_attn operator (see saved_k and saved_v). Those static key/value tensors can now be re-used when training the model. 4. Add more test cases to cover the arguments. Note: current users should not be affected by the changes. Pull Request resolved: pytorch#21288 Differential Revision: D15738808 Pulled By: zhangguanheng66 fbshipit-source-id: 288b995787ad55fba374184b3d15b5c6fe9abb5c

Guanheng Zhang and others added 15 commits May 17, 2019 11:42

Remove the internal functions in multi_head_attention_forward.

ddd7aa1

Fix issue pytorch#20722.

866ec49

add an unit test (i.e. test_torchscript_multi_head_attn) in test_jit.

656b079

Fix lint errors.

5167c24

Merge remote-tracking branch 'origin/master' into multi_internal_func…

19572fd

…tions

Add attn_mask covered by test_torchscript_multi_head_attn.

6571da6

Add a jit unit test for MultiheadAttention module (see test_scriptmod…

bc986d0

…ule_multi_head_attn_cuda).

Minor change test_scriptmodule_multi_head_attn_cuda.

ada2ffc

Support key and value with different embed dimension in MultiheadAtte…

9aa175c

…ntion module.

Support static key and value.

74c184a

Update unit test TestNN.test_multihead_attention.

11d9de0

Merge remote-tracking branch 'upstream/master' into k_v_diff_dim

768d88d

Update jit unit tests for the new multiheadattention.

53671df

Merge branch 'master' into k_v_diff_dim

ac3ad1d

Update nn.MultiheadAttention docs.

91a94b7

zhangguanheng66 requested a review from cpuhrsch June 3, 2019 16:20

pytorchbot added oncall: jit Add this issue/PR to JIT oncall triage queue module: nn Related to torch.nn labels Jun 3, 2019

soumith suggested changes Jun 3, 2019

View reviewed changes

zhangguanheng66 mentioned this pull request Jun 3, 2019

Integrate torch.nn and fairseq MultiheadAttention facebookresearch/fairseq#772

Closed

Guanheng Zhang added 3 commits June 4, 2019 07:33

Move kdim and vdim to the end of __init__

2158581

Update _load_from_state_dict function in MultiheadAttention module.

ecabd65

Minor changes in _load_from_state_dict after testing.

6a9584e

ezyang added facebook and removed facebook labels Jun 5, 2019

Guanheng Zhang added 3 commits June 6, 2019 11:21

Support chunk/separate proj_weight.

dfa1c92

Minor changes to resolve version conflict

778748c

Update the unit test.

3736db9

cpuhrsch reviewed Jun 6, 2019

View reviewed changes

Guanheng Zhang added 5 commits June 6, 2019 14:48

fix a bug in in_proj_bias.

5677552

Remove in-place multi.

c884eec

Remove some comments.

56dade0

Add assert to check the dimension of q_proj_weight, k_proj_weight, v_…

3e95f4c

…proj_weight.

Update docs.

dbb6251

Guanheng Zhang added 2 commits June 9, 2019 14:15

Use use_separate_proj_weigh

1bdaadb

Merge branch 'master' into k_v_diff_dim

1064012

facebook-github-bot reviewed Jun 10, 2019

View reviewed changes

Guanheng Zhang added 2 commits June 10, 2019 07:30

Fix lint errors.

5e4a25e

Merge branch 'master' into k_v_diff_dim

6d1f0da

facebook-github-bot reviewed Jun 14, 2019

View reviewed changes

cpuhrsch reviewed Jun 17, 2019

View reviewed changes

ngimel reviewed Jun 17, 2019

View reviewed changes

facebook-github-bot reviewed Jun 19, 2019

View reviewed changes

Merge branch 'master' into k_v_diff_dim

9aef402

facebook-github-bot reviewed Jul 2, 2019

View reviewed changes

cpuhrsch reviewed Jul 2, 2019

View reviewed changes

Add comments for inline in_proj function.

1305e1e

zhangguanheng66 changed the title ~~Update MultiheadAttention module to integrate fairseq version with torch.nn version~~ Update MultiheadAttention module support key/value with different number of features and allow static key/value Jul 2, 2019

Add warning.

6c743c7

facebook-github-bot reviewed Jul 2, 2019

View reviewed changes

facebook-github-bot closed this in bb0f299 Jul 3, 2019

facebook-github-bot added the merged label Jul 3, 2019

zhangguanheng66 deleted the k_v_diff_dim branch July 12, 2019 17:06

driazati mentioned this pull request Aug 13, 2019

[jit] Support MultiheadedAttention module #24204

Closed

mruberry added the Merged label Oct 28, 2020

	q = q.contiguous().view(tgt_len, bsz * num_heads, head_dim).transpose(0, 1)
	if k is not None:
	k = k.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)
	if v is not None:
	v = v.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)

Update MultiheadAttention module support key/value with different number of features and allow static key/value #21288

Update MultiheadAttention module support key/value with different number of features and allow static key/value #21288

Uh oh!

Conversation

zhangguanheng66 commented Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangguanheng66 commented Jun 7, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngimel Jun 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

cpuhrsch Jul 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

zhangguanheng66 commented Jun 3, 2019 •

edited

Loading

ngimel Jun 17, 2019 •

edited

Loading

cpuhrsch Jul 2, 2019 •

edited

Loading