-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Gradient accumulation is untested #3036
Description
I often run into the same problem as #2532, where it is necessary to zero-initialize the bottom diff, but not the parameter diffs. This is unintuitive, but acceptable as long as it's both documented and tested. The problem is that there are no tests to check each layer for this behavior.
If you are opposed to fixing the test as per tnarihi@7d45526 (the "+1, -1" trick), then an alternative would be to add a function to every layer that says whether or not it supports gradient accumulation (default false). Then, the gradient checker would apply the "+1, -1" trick only to those that claim to support it. In the case that someone uses iter_size > 1, the net would check all its layers and raise an exception if any layer has parameters but doesn't support gradient accumulation.
I'm opening this issue because I think it needs to be addressed in one way or another.