Add a `handle_buffers` option for EMAHandler by sandylaker · Pull Request #2592 · pytorch/ignite

sandylaker · 2022-06-06T10:24:40Z

Fixes #2590

Description:

Add a handle_buffers option for EMAHandler.

cc @vfdev-5 @RicherMans

Check list:

New tests are added (if a new feature is added)
New doc strings: description and/or example code are in RST format
Documentation is updated (if required)

vfdev-5 · 2022-06-06T10:32:21Z

@sandylaker thanks for a quick PR!

As for the problem about integral params in buffers, does it mean that in pytorch they should here a similar issue here: https://github.com/pytorch/pytorch/blob/080cf84bed46c6c118c37fc2fa6fbd484fd9b4cd/torch/optim/swa_utils.py#L124-L132 ?

RicherMans · 2022-06-06T10:38:52Z

@vfdev-5

As for the problem about integral params in buffers, does it mean that in pytorch they should here a similar issue here: https://github.com/pytorch/pytorch/blob/080cf84bed46c6c118c37fc2fa6fbd484fd9b4cd/torch/optim/swa_utils.py#L124-L132 ?

Yeah they should, but the default usage for the SWA model is to update the batchnorm after training and not during via https://github.com/pytorch/pytorch/blob/080cf84bed46c6c118c37fc2fa6fbd484fd9b4cd/torch/optim/swa_utils.py#L136 .

Also as I would hypothesize, that most code and examples usually set the model to .train() before transfer, thus the problem is less likely to occur.
Also (at least for me), its not that the model itself is throwing exceptions or is completely "useless", but rather that it just converges to a less optimal solution due to the "wrong" means and variances within the batch-norm.

Actually, I once trained a common Mean-teacher recipe and checked the absolute difference between the online (student) model and the EMA model.
For a lower layer like the first batch-norm, the batch-norm differences for mean/variance are neglectable and usually in range < 1e-1
But for higher layers, the differences get very noticeable at about ~ += 2 for means and += 200 for variance.

sandylaker · 2022-06-06T10:39:11Z

@vfdev-5 I did not test the torch.optim.swa_utils.AverageModel yet. But I did a simple experiment as follows:

>>> num_batches_tracked = torch.tensor(0, dtype=torch.int64)
>>> num_batches_tracked.copy_(num_batches_tracked * 0.9998 + 1 * 0.0002)
tensor(0)

I think the rounding error will be more severe when the online value is small.

@RicherMans

In the current implementation, the running_mean and running_var in a _BatchNorm will be updated with momentum, just like other params. But the num_batches_tracked will be directly copied. Do you think that the performence degradation issue will still present? Could you please test it in your codebase? Thanks a lot.

vfdev-5 · 2022-06-06T10:51:39Z

@sandylaker if you check again "Solutions" of #2590 it looks like that RicherMans already tried to update floating buffers and copy integers (see his code snippet).
Looks like 2nd option in the solutions is better than 1st.

For this PR I think we could add an arg with 3 options: 1) "copy" = to keep current (copy) behaviour, 2) "update_buffers" = update buffers according to Richer's option 1 and 3) "ema_train" = Richer's option 2

What do you think ?

RicherMans · 2022-06-06T11:16:21Z

@sandylaker

In the current implementation, the running_mean and running_var in a _BatchNorm will be updated with momentum, just like other params. But the num_batches_tracked will be directly copied. Do you think that the performence degradation issue will still present? Could you please test it in your codebase? Thanks a lot.

So I actually did already check during my investigations all the proposed methods. So to be more precise for my experiments, I do sound event classification for the DCASE 2022 task, where the baseline uses a mean-teacher approach that I have reimplemented using EMAHandler.
For the baseline, they report two values (PSDS-1 and PSDS-2 scores) for the experiment. Higher is better.

The results are as follows:

Model	PSDS1	PSDS2
Baseline	33.6	53.60
EMAHAndler v0.4.9 (Old)	16.95	30.86
`use_buffers=True` (New)	27.53	43.99
set model to .train() and do not update buffers (suggestion 2)	33.9	54.84

So as I originally stated, the momentum update of buffers such as the mean and variance works obviously better than the simple copying mechanism.
However, all results pale in comparison to just simply let the model itself compute it's mean and variance.

Of course this is a mean-teacher learning case, and my results might not be representative for other cases where one might want to "freeze" the buffers.

What do you guys think?

vfdev-5 · 2022-06-06T12:35:08Z

@RicherMans thanks for sharing details !

I think we could provide all 3 options such that depending on a use-case user could pick appropriate configuration (as suggested above: #2592 (comment)). We have to ensure that the 3rd option ema_model.train() does not break some assumptions in the flow.

vfdev-5 · 2022-06-11T09:44:00Z

@sandylaker can we move forward with this PR and implement 3 options described in #2592 (comment) ? Thank you

sandylaker · 2022-06-11T09:46:26Z

@vfdev-5 Hi. Sorry for the late reply, I have been busy during the working days. I will implement it this weekend.

vfdev-5

Looks great, thanks a lot @sandylaker !

vfdev-5 · 2022-06-14T09:14:47Z

@RicherMans could you please check this PR if it solves the issue you have

vfdev-5 · 2022-06-15T06:03:13Z

I merged this PR and if we need more changes to it, we can do that it in a follow-up PR.
Thanks again @sandylaker !

RicherMans · 2022-06-15T13:04:08Z

Sorry @vfdev-5 for my late response.
Yeah that commit looks pretty good and I tested it, looks alright!

Thanks guys for the good work! ( Even though I hope I can also one time commit myself )

vfdev-5 · 2022-06-15T13:22:12Z

Even though I hope I can also one time commit myself

That would be awesome ! By the way, if you need some help to start with that, feel free to join our discord: https://pytorch-ignite.ai/chat

add use_buffers option

e3ebbf1

github-actions bot added the module: handlers Core Handlers module label Jun 6, 2022

change default use_buffers to True

d22878c

sandylaker added 2 commits June 13, 2022 10:04

Merge branch 'master' into ema-buffer

b9a2530

change use_buffers to handle_buffers

2aa16b3

sandylaker changed the title ~~Add a use_buffers option for EMAHandler~~ Add a handle_buffers option for EMAHandler Jun 14, 2022

vfdev-5 approved these changes Jun 14, 2022

View reviewed changes

sandylaker and others added 2 commits June 15, 2022 07:34

Merge branch 'master' into ema-buffer

3d206d0

Merge branch 'master' into ema-buffer

b546cd3

vfdev-5 merged commit b7f3d60 into pytorch:master Jun 15, 2022

Uh oh!

Conversation

sandylaker commented Jun 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vfdev-5 commented Jun 6, 2022

Uh oh!

RicherMans commented Jun 6, 2022

Uh oh!

sandylaker commented Jun 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vfdev-5 commented Jun 6, 2022

Uh oh!

RicherMans commented Jun 6, 2022

Uh oh!

vfdev-5 commented Jun 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vfdev-5 commented Jun 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandylaker commented Jun 11, 2022

Uh oh!

vfdev-5 left a comment

Choose a reason for hiding this comment

Uh oh!

vfdev-5 commented Jun 14, 2022

Uh oh!

vfdev-5 commented Jun 15, 2022

Uh oh!

RicherMans commented Jun 15, 2022

Uh oh!

vfdev-5 commented Jun 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sandylaker commented Jun 6, 2022 •

edited

Loading

sandylaker commented Jun 6, 2022 •

edited

Loading

vfdev-5 commented Jun 6, 2022 •

edited

Loading

vfdev-5 commented Jun 11, 2022 •

edited

Loading