Skip to content

Conversation

@n-poulsen
Copy link
Contributor

This pull request addresses learning rate schedulers when resuming training. Currently, the schedulers are simply built from the config file. So when continuing to train from an existing snapshot, the scheduler will restart at epoch 0, and the learning rate won't be adapted (as mentioned in #2784).

In this pull request, the code is updated to:

  • save the scheduler state dicts in snapshots
  • when resuming training, try to load the state dict for the scheduler
    • if successful, set the optimizers learning rate to the last learning rate from the scheduler
  • a load_scheduler_state_dict key is added to the runner configuration
    • the default is True: loading a snapshot with a saved scheduler to continue training will load the scheduler state dict
    • however, users might edit the scheduler's parameters and want an updated learning rate to continue training
    • in that case, they need to set load_scheduler_state_dict: false so the state dict doesn't overwrite their edited parameters
    • the self.starting_epoch value is used to set the last_epoch in the scheduler; so the way the learning rate will be schedule matches the epochs printed in the logs

Tests were added to validate that the expected learning rates are applied.

@n-poulsen n-poulsen added enhancement New feature or request DLC3.0🔥 labels Nov 14, 2024
@n-poulsen n-poulsen merged commit b9a80bd into pytorch_dlc Nov 21, 2024
@n-poulsen n-poulsen deleted the niels/save_scheduler_state_dict branch November 21, 2024 16:09
xiu-cs pushed a commit to xiu-cs/DeepLabCut that referenced this pull request Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DLC3.0🔥 enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants