Skip to content

Conversation

@ezyang
Copy link
Contributor

@ezyang ezyang commented Oct 5, 2018

Signed-off-by: Edward Z. Yang [email protected]

CC @deepakn94

@pietern
Copy link
Contributor

pietern commented Oct 5, 2018

Wow, nice catch. Also should we follow up with taking an exclusive lock on a file when writing a checkpoint?

@ezyang
Copy link
Contributor Author

ezyang commented Oct 5, 2018

It would be good if there were an official, well known, non-footgun way of saving checkpoints in this context.

@deepakn94
Copy link

I think it's hard to completely eliminate "bad" behavior. For example, people could checkpoint a model in some local file, and then move the checkpoint to a network file system (https://github.com/stanford-futuredata/pytorch-distributed/blob/master/training/train_imagenet_nv.py#L408).

But locks on files might be a good thing to do anyway.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants