-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Labels
Description
Right now, our checkpointing exists / works, but we've been thinking of rethinking it and having particular use-cases / goals to be covered.
Here are some things that need to be covered by the new checkpointing:
- save to GPU and load from CPU, i.e. separate the type and location of the saved Tensors and allow remapping locations at load time
- If one saves a model, changes the call operator of the class and loads the model back, the model is not doing the same thing as before. use python's inspect API to save the classes current source code with the model, and Warn if the loaded source is different from the Class source.
- Make the endianness and long-size of the checkpoint consistent and working across all platforms
- Allow one to get the parameters of a model as a super simple name / tensor dictionary. This decouples the problem of versioning the Container class to the parameters. This also allows one to save the weights from a model and load it into another model, as the keys here are simply named-strings of each parameter. For example:
{ 'conv1.weight' : torch.FloatTensor(...), 'resblock1.conv3.bias' : torch.FloatTensor(...), ...} - dumping trainer / optimizer state