Skip to content

1757 Enhance CheckpointLoader to restore max_epochs#1775

Merged
ericspod merged 11 commits intoProject-MONAI:masterfrom
Nic-Ma:1757-enhance-checkpointloader
Mar 16, 2021
Merged

1757 Enhance CheckpointLoader to restore max_epochs#1775
ericspod merged 11 commits intoProject-MONAI:masterfrom
Nic-Ma:1757-enhance-checkpointloader

Conversation

@Nic-Ma
Copy link
Copy Markdown
Contributor

@Nic-Ma Nic-Ma commented Mar 16, 2021

Fixes #1757 .

Description

This PR fixed the issue that if loading the state_dict of engine itself, it will overwrite current max_epochs value.

Status

Ready

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2021

/black

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2021

@vfdev-5 , could you please help review this PR?

Thanks in advance.

Copy link
Copy Markdown
Member

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2021

/black

@danieltudosiu
Copy link
Copy Markdown
Contributor

The current error message is ambiguous and does not really help the user debug without a dive in the MONAI codebase. I would change the error message to be more descriptive from:

raise ValueError("current max epochs setting is smaller than epoch count in the checkpoint.")

to

raise ValueError(f"The epoch count found in the checkpoint ({engine.state.epoch}) is larger than the engine.state.max_epochs of the constructed engine prior loading the checkpoint ({prior_max_epochs}). If you want to further train this checkpoint please construct the trainer with a max_epochs bigger than the epoch count of the checkpoint and bigger or equal with its max_epochs. If you want to use it for inference please use the same max_epochs you used during training.")

It might be too stuffy, please have a go at trimming it down if you feel it needs it.

@ericspod
Copy link
Copy Markdown
Member

raise ValueError(f"The epoch count found in the checkpoint ({engine.state.epoch}) is larger than the engine.state.max_epochs of the constructed engine prior loading the checkpoint ({prior_max_epochs}). If you want to further train this checkpoint please construct the trainer with a max_epochs bigger than the epoch count of the checkpoint and bigger or equal with its max_epochs. If you want to use it for inference please use the same max_epochs you used during training.")

A little more concise:

ValueError(f"Epoch count in checkpoint ({engine.state.epoch}) is larger than the `engine.state.max_epochs` of engine ({prior_max_epochs}). To further train from checkpoint construct trainer with `max_epochs` larger than checkpoint's epoch count. To use for inference `max_epochs` must be the same as checkpoint.")

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2021

Hi @ericspod ,

I updated the PR for the error message, slightly modified the inference part.
Could you please help review it again?

Thanks.

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2021

/black

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2021

/black

@ericspod ericspod enabled auto-merge (squash) March 16, 2021 15:27
@ericspod ericspod merged commit 050efb7 into Project-MONAI:master Mar 16, 2021
@Nic-Ma Nic-Ma deleted the 1757-enhance-checkpointloader branch July 2, 2021 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Continue training past checkpoint's max epochs

4 participants