Skip to content

rethink checkpointing #86

@soumith

Description

@soumith

Right now, our checkpointing exists / works, but we've been thinking of rethinking it and having particular use-cases / goals to be covered.

Here are some things that need to be covered by the new checkpointing:

  • save to GPU and load from CPU, i.e. separate the type and location of the saved Tensors and allow remapping locations at load time
  • If one saves a model, changes the call operator of the class and loads the model back, the model is not doing the same thing as before. use python's inspect API to save the classes current source code with the model, and Warn if the loaded source is different from the Class source.
  • Make the endianness and long-size of the checkpoint consistent and working across all platforms
  • Allow one to get the parameters of a model as a super simple name / tensor dictionary. This decouples the problem of versioning the Container class to the parameters. This also allows one to save the weights from a model and load it into another model, as the keys here are simply named-strings of each parameter. For example:
    { 'conv1.weight' : torch.FloatTensor(...), 'resblock1.conv3.bias' : torch.FloatTensor(...), ...}
  • dumping trainer / optimizer state

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions