-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Open
Labels
module: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🚀 The feature, motivation and pitch
Especially during hyperparameter optimization, exceptions like OOM can occur.
I'm looking for a way to restore and recover from OOM exceptions and would like to propose an additional force parameter for torch.cuda.empty_cache(), that forces PyTorch to release all cache, even if due to a memory leak some elements remain.
Optionally a function like torch.cuda.reset() would obviously work as well.
Current suggestions with gc.collect and torch.cuda.empty_cache() are not reliable enough to restore the initial state.
Alternatives
Completely restart Python kernel releases all CUDA memory, but does not work on HPO.
Additional context
Suggestions how to properly track down memory leaks and solve my core problem are appreciated.
cc @ngimel
rohitgr7, twsl, tbernert, PabloVD, OracleToes and 30 moredevanshkhandekar, nouranali, twsl and nalzok
Metadata
Metadata
Assignees
Labels
module: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module