Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #14524 (IndexError when mask prob is too low) #14525

Merged
merged 7 commits into from
Dec 2, 2021

Conversation

nikvaessen
Copy link
Contributor

@nikvaessen nikvaessen commented Nov 25, 2021

What does this PR do?

This fixes the issue described in #14524. This assumes that users are okay with at least some masking if they set the probability to something > 0. I assume this is an okay fix because for time masking min_masks=2.

Who can review?

@patrickvonplaten

Copy link
Member

@anton-l anton-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nikvaessen, thank you for catching this bug!

For context: this logic is inherited from the original fairseq implementation so I think we should leave min_masks as-is for reprodicibility.

However, I think we can easily avoid the IndexError itself without breaking anything by simply using a fixed dummy index here:

-    # pick first sampled index that will serve as a dummy index to pad vector
-    dummy_mask_idx = spec_aug_mask_idx[0]
+    # a dummy index to pad vector
+    dummy_mask_idx = 0 

Adding this test to Wav2Vec2UtilsTest makes sure it's correct:

    def test_compute_mask_low_prob(self):
        batch_size = 4
        sequence_length = 60
        mask_prob = 0.0001
        mask_length = 4

        mask = _compute_mask_indices((batch_size, sequence_length), mask_prob, mask_length)
        mask = torch.from_numpy(mask).to(torch_device)

        self.assertEqual(mask.sum(), 0)

@patrickvonplaten correct me if I'm interpreting the dummy index incorrectly, but seems like it works properly with this change.

@nikvaessen
Copy link
Contributor Author

nikvaessen commented Nov 27, 2021

I've had a deeper look into the actual masking methods in fairseq and transformers. I think the documentation is vague, and at odds with the description in https://arxiv.org/abs/2006.11477:

During fine-tuning we apply a masking strategy to the feature encoder outputs similar to SpecAug-
ment [41]: we randomly choose a number of starting time steps for which a span of ten subsequent
time-steps is replaced with a mask embedding; spans may overlap and we use the same masked time
step embedding as during pre-training. We also mask channels by choosing a number of channels as
starting indices and then expand each one to cover the subsequent 64 channels. Spans may overlap
and the selected channel spans are set to zero value.

Let's take the probabilities of fine-tuning on full 960h of librispeech in Table 6: 0.05 for masking in the time axis and 0.0016 for masking in the channel/feature axis. Let's also assume these probabilities to mean the independent probability for each vector to be the start of a masked span of length n.

Let's assume a 3 second audio clip, for the facebook/wav2vec2-base model this would be a tensor of shape [150 (time), 768 (features)].

In the time axis, we expect 0.05 * 150 = 7.5 vectors to be the start of a mask, for a maximum of 7.5 * length of 10 = 75 vectors be masked if no overlap, or in other words, 75/150 = 50% of the sequence length.

In the feature axis, we expect 0.0016 * 768 = 1.22 vectors to be the start of a mask, for a maximum of 1.22 * length of 64 = 78 vectors to be masked if no overlap, or in other words, 78/768 ~= 10% of the feature length.

Let's now take a look at the documentation of compute_mask_indices in fairseq:

mask_prob: probability for each token to be chosen as start of the span to be masked. this will be multiplied by number of timesteps divided by length of mask span to mask approximately this percentage of all elements. however due to overlaps, the actual number will be smaller (unless no_overlap is True)

and the documentation in transformers:

Propability of each feature vector along the time axis to be chosen as the start of the vector span to be masked. Approximately mask_time_prob * sequence_length // mask_time_length feature vectors will be masked along the time axis. This is only relevant if apply_spec_augment is True.

If we look in the config file for fine-tuning in fairseq:

https://github.com/pytorch/fairseq/blob/2380a6e46675ca17bdf22be06bc7c6d138736e59/examples/wav2vec/config/finetuning/base_960h.yaml#L51-L52

we see that they actually use the percentage of the whole sequence which should be masked (50% and 10%) instead of the probabilities quoted in the article.

But, the documentation is not correct, as 0.50 * 150 // 10=7 and 0.1 * 768 // 64 = 1.0 is the number of vectors which start a span, and not the percentage of spanned elements (according to fairseq), nor the number of features which will be masked (according to transformers).

So what do the mask_probs need to be in transformers to replicate the behavior in fairseq?

for feature axis:

import torch as t

from transformers.models.wav2vec2.modeling_wav2vec2 import _compute_mask_indices

batch_size = 10
sequence_length = 768
mask_prob = 0.10
mask_length = 64

mask = _compute_mask_indices(
    shape=(batch_size, sequence_length),
    mask_prob=mask_prob,  # or even lower
    mask_length=mask_length,
)
mask = t.from_numpy(mask)

print("number of masked vectors in each batch dimension")
num_masks = t.sum(mask, dim=1)
print(num_masks)
# prints tensor([64, 64, 64, 64, 64, 64, 64, 64, 64, 64])

for time axis:

import torch as t

from transformers.models.wav2vec2.modeling_wav2vec2 import _compute_mask_indices

batch_size = 10
sequence_length = 150
mask_prob = 0.5
mask_length = 10

mask = _compute_mask_indices(
    shape=(batch_size, sequence_length),
    mask_prob=mask_prob,  # or even lower
    mask_length=mask_length,
)
mask = t.from_numpy(mask)

print("number of masked vectors in each batch dimension")
num_masks = t.sum(mask, dim=1)
print(num_masks)
# prints tensor([72, 76, 66, 73, 70, 59, 65, 63, 68, 75])

what about this PR

First, there should still be a bugfix for when mask_prob is too low. Incidentally, this bug also exist for fairseq!

import torch as t

from fairseq.data.data_utils import compute_mask_indices


batch_size = 10
sequence_length = 768
mask_prob = 0.0001
mask_length = 64

mask = compute_mask_indices(
    shape=(batch_size, sequence_length),
    padding_mask=None,
    mask_prob=mask_prob,  # or even lower
    mask_length=mask_length,
)
mask = t.from_numpy(mask)

returns

Traceback (most recent call last):
  File "/home/nik/workspace/phd/repo/transformers/playground_fairseq.py", line 12, in <module>
    mask = compute_mask_indices(
  File "/home/nik/workspace/phd/repo/transformers/venv/lib/python3.8/site-packages/fairseq/data/data_utils.py", line 424, in compute_mask_indices
    lengths[0] = min(mask_length, sz - 1)
IndexError: index 0 is out of bounds for axis 0 with size 0

However, instead we should probably warn users, or raise an error, that their chosen mask_prob is too low to ever insert a mask.

Moreover, we should take one of two options:

  • fix mask_prob to mean the the independent probability for each vector to be the start of a masked span of length n
  • fix mask_prob to mean the percentage of the sequence which will be masked.

If we can agree on one of these two options, I can update this PR.

With this commit the meaing of `mask_prob` actually adhered to the probability for each
vector to be the start of a masked span of length.
@nikvaessen
Copy link
Contributor Author

nikvaessen commented Nov 28, 2021

Well, I decided to go for option 1 because:

  • It requires fewer changes to the code
  • it matches the wav2vec2 article description, with I expect people to read sooner than the fairseq code
  • it's easier to reason about a mask_length if mask_prob means option 1.
  • we don't need to change the configuration name in the code, preventing API breakage (assuming mask_prob should be renamed to something like mask_percentage.
  • the default value of mask_time_prob now matches the settings for finetuning on 960h in the article.

As for the failing test, I'm not familiar with the other models, so I don't know if it's acceptable to change their implementation of _comptute_mask_indices as well.

Also, as the default value for mask_time_prob is 0.05, why is the mask_feature_prob defaulted to 0 instead of 0.0016 or 0.0008?

@nikvaessen
Copy link
Contributor Author

Well, that was obviously not as easy as just fix-copy. If the changes in this PR as acceptable I can change the configs for each model, and look into fixing their tests as well...

@patrickvonplaten
Copy link
Contributor

Hey @nikvaessen,

Thanks for the PR and having dove into the complexity of this function. It is indeed not very easy to grasp what's going on there. And you are right there are two slighly different interpretation of the mask_prob value.

In the paper, mask_prob is used to state the probability that each token is the start index of the mask span. In the code however, mask_prob is rather used as the upper bound to the overall percentage of possible masked vectors.

Note that this function is also used in pretraining and it has been shown that Wav2Vec2 pretraining is extremely sensible to how those mask_time_indices are generated. This PR sadly includes too many changes for a fix for a low mask probability.

Could we instead maybe just simple do the following:

  • a) either add a min_mask function argument to _compute_mask_indices to prevent this error
  • b) I'm fine with computing/catching when the probability is too low and then simple set the whole mask to 0

Given that this is such a sensible function, it would be great if we could in a first step just make very minor changes to the function to solve the issue. Let me know what you think! Otherwise, I'm also happy to open a small first PR for this

…ing percentage`, revert changes to _compute_mask_indices
@nikvaessen
Copy link
Contributor Author

nikvaessen commented Dec 1, 2021

I've reverted my changes. This PR now updates the documentation so it's (hopefully) more clear how mask_prob should be interpreted. The IndexError described in the linked issue should now be fixed by returning the zero mask when max_num_masked_span is 0. This is possible because we only compute epsilon once. It might be more elegant to compute epsilon separately for every batch dimension, but as you indicated that pre-training is very sensitive to the masks I thought it better to leave it as is.

@patrickvonplaten
Copy link
Contributor

Hey @nikvaessen,

Great thanks a lot for your PR, it looks very clean now. Also thanks a lot for improving the docs - that's super useful for the community!

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Looks good to me now

@patrickvonplaten
Copy link
Contributor

@anton-l - can you give it a second look and if ok for you merge the PR? :-) Ok to merge on my side

Copy link
Member

@anton-l anton-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the very clear docs @nikvaessen!

@anton-l anton-l merged commit 6645eb6 into huggingface:master Dec 2, 2021
@nikvaessen nikvaessen deleted the patch-1 branch December 2, 2021 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants