-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix #14524 (IndexError when mask prob is too low) #14525
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nikvaessen, thank you for catching this bug!
For context: this logic is inherited from the original fairseq implementation so I think we should leave min_masks
as-is for reprodicibility.
However, I think we can easily avoid the IndexError
itself without breaking anything by simply using a fixed dummy index here:
- # pick first sampled index that will serve as a dummy index to pad vector
- dummy_mask_idx = spec_aug_mask_idx[0]
+ # a dummy index to pad vector
+ dummy_mask_idx = 0
Adding this test to Wav2Vec2UtilsTest
makes sure it's correct:
def test_compute_mask_low_prob(self):
batch_size = 4
sequence_length = 60
mask_prob = 0.0001
mask_length = 4
mask = _compute_mask_indices((batch_size, sequence_length), mask_prob, mask_length)
mask = torch.from_numpy(mask).to(torch_device)
self.assertEqual(mask.sum(), 0)
@patrickvonplaten correct me if I'm interpreting the dummy index incorrectly, but seems like it works properly with this change.
I've had a deeper look into the actual masking methods in
Let's take the probabilities of fine-tuning on full 960h of librispeech in Table 6: Let's assume a 3 second audio clip, for the In the time axis, we expect In the feature axis, we expect Let's now take a look at the documentation of
and the documentation in transformers:
If we look in the config file for fine-tuning in fairseq: we see that they actually use the percentage of the whole sequence which should be masked (50% and 10%) instead of the probabilities quoted in the article. But, the documentation is not correct, as So what do the for feature axis: import torch as t
from transformers.models.wav2vec2.modeling_wav2vec2 import _compute_mask_indices
batch_size = 10
sequence_length = 768
mask_prob = 0.10
mask_length = 64
mask = _compute_mask_indices(
shape=(batch_size, sequence_length),
mask_prob=mask_prob, # or even lower
mask_length=mask_length,
)
mask = t.from_numpy(mask)
print("number of masked vectors in each batch dimension")
num_masks = t.sum(mask, dim=1)
print(num_masks)
# prints tensor([64, 64, 64, 64, 64, 64, 64, 64, 64, 64]) for time axis: import torch as t
from transformers.models.wav2vec2.modeling_wav2vec2 import _compute_mask_indices
batch_size = 10
sequence_length = 150
mask_prob = 0.5
mask_length = 10
mask = _compute_mask_indices(
shape=(batch_size, sequence_length),
mask_prob=mask_prob, # or even lower
mask_length=mask_length,
)
mask = t.from_numpy(mask)
print("number of masked vectors in each batch dimension")
num_masks = t.sum(mask, dim=1)
print(num_masks)
# prints tensor([72, 76, 66, 73, 70, 59, 65, 63, 68, 75]) what about this PRFirst, there should still be a bugfix for when import torch as t
from fairseq.data.data_utils import compute_mask_indices
batch_size = 10
sequence_length = 768
mask_prob = 0.0001
mask_length = 64
mask = compute_mask_indices(
shape=(batch_size, sequence_length),
padding_mask=None,
mask_prob=mask_prob, # or even lower
mask_length=mask_length,
)
mask = t.from_numpy(mask) returns
However, instead we should probably warn users, or raise an error, that their chosen Moreover, we should take one of two options:
If we can agree on one of these two options, I can update this PR. |
With this commit the meaing of `mask_prob` actually adhered to the probability for each vector to be the start of a masked span of length.
Well, I decided to go for option 1 because:
As for the failing test, I'm not familiar with the other models, so I don't know if it's acceptable to change their implementation of Also, as the default value for |
Well, that was obviously not as easy as just fix-copy. If the changes in this PR as acceptable I can change the configs for each model, and look into fixing their tests as well... |
Hey @nikvaessen, Thanks for the PR and having dove into the complexity of this function. It is indeed not very easy to grasp what's going on there. And you are right there are two slighly different interpretation of the In the paper, Note that this function is also used in Could we instead maybe just simple do the following:
Given that this is such a sensible function, it would be great if we could in a first step just make very minor changes to the function to solve the issue. Let me know what you think! Otherwise, I'm also happy to open a small first PR for this |
…ing percentage`, revert changes to _compute_mask_indices
I've reverted my changes. This PR now updates the documentation so it's (hopefully) more clear how |
Hey @nikvaessen, Great thanks a lot for your PR, it looks very clean now. Also thanks a lot for improving the docs - that's super useful for the community! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Looks good to me now
@anton-l - can you give it a second look and if ok for you merge the PR? :-) Ok to merge on my side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the very clear docs @nikvaessen!
What does this PR do?
This fixes the issue described in #14524. This assumes that users are okay with at least some masking if they set the probability to something > 0. I assume this is an okay fix because for time masking
min_masks=2
.Who can review?
@patrickvonplaten