Add safe_globals to resume training on PyTorch 2.6 #34632

dvrogozh · 2024-11-06T19:32:49Z

Starting from version 2.4 PyTorch introduces a stricter check for the objects which can be loaded with torch.load(). Starting from version 2.6 loading with weights_only=True requires allowlisting of such objects.

This commit adds allowlist of some numpy objects used to load model checkpoints. Usage is restricted by context manager. User can still call torch.serialization.add_safe_globals() to add other objects into the safe globals list.

Accelerate library also stepped into same problem and addressed it with PR-3036.

Fixes: #34631
See: pytorch/pytorch#137602
See: https://pytorch.org/docs/stable/notes/serialization.html#torch.serialization.add_safe_globals
See: huggingface/accelerate#3036

CC: @muellerzr @SunMarc

src/transformers/trainer.py

dvrogozh · 2024-11-13T17:51:31Z

@muellerzr, @SunMarc, @ArthurZucker : can you, please, help comment on this PR? see issue #34631 on details.

SunMarc

Nice ! Thanks for adding this ! Left a comment

src/transformers/trainer.py

ydshieh · 2024-11-15T17:57:48Z

I am getting

FAILED tests/trainer/test_trainer.py::TrainerIntegrationTest::test_can_resume_training - AttributeError: module 'numpy' has no attribute 'dtypes'. Did you mean: 'dtype'?

when running

python3 -m pytest tests/trainer/test_trainer.py::TrainerIntegrationTest::test_can_resume_training

against this PR.

dvrogozh · 2024-11-15T18:10:14Z

@ydshieh : this might be due to numpy version. dtypes was added in 1.25 according to https://numpy.org/doc/2.1/reference/routines.dtypes.html#module-numpy.dtypes. Locally I have 1.26.4. Which version do you have?

I will work on using context manager since there is an alignment on that and also tune a list per versioning of numpy.

ydshieh · 2024-11-15T18:15:04Z

On our CI runner , I get numpy=1.24.3

mikaylagawarecki · 2024-11-15T19:22:54Z

The numpy GLOBALs for dtypes that need to be allowlisted might need an if statement depending on whether version < 1.25 or not, there's some documentation on this here https://pytorch.org/docs/main/notes/serialization.html#troubleshooting-weights-only

ArthurZucker

cc @muellerzr if you can have a look as well!

ArthurZucker · 2024-11-15T21:57:16Z

src/transformers/trainer.py

We could have a SAFE_TRANSFORMERS_GLOBAL with these no? this way people can easily update them?
TBH I prefer the context manager but want to have the least duplication as possible!

I found that calling torch.serialization.add_safe_globals() still works to add additional safe global staff. SAFE_TRANSFORMERS_GLOBAL can also be considered. Let me know if you see the need.

dvrogozh · 2024-11-16T00:21:15Z

src/transformers/trainer.py

Should I add any other numpy dtypes in the list? As of now I spotted only np.unit32 in the Transformers list as the one needed.

The only one I don't see from accelerate is encode, however if things pass here without it it's accelerate specific and we don't need to worry about it

Transformer tests did pass on my side without adding encode. This indeed seems accelerate specific.

muellerzr

Thanks! Just a documentation suggestion but this all looks correct

src/transformers/trainer.py

HuggingFaceDocBuilderDev · 2024-11-20T15:53:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dvrogozh · 2024-11-20T18:07:42Z

Thanks! Just a documentation suggestion but this all looks correct

@muellerz : done, added a link to Accelerate PR.

SunMarc

LGTM ! Just a nit

src/transformers/trainer.py

dvrogozh · 2024-11-21T18:13:29Z

LGTM ! Just a nit

@SunMarc : addressed, reused approach from accelerate on numpy.core deprecation.

ydshieh · 2024-11-22T11:38:26Z

src/transformers/trainer.py

just a nit: should it be "2.6.0" here or it's really necessary being "2.4.0"?

Switched to version < 2.6.0a0. Indeed, on switching to context manager I overlooked that it was introduced later. Overall:

torch.serialization.add_safe_globals appeared in pytorch 2.4

torch.serialization.safe_globals (context manager) appeared in 2.5

And pytorch 2.6 flipped default of weights_only in torch.load from False to True

Overall, it indeed does not make sense to have this code working for versions earlier than 2.6 unless we will start calling torch.load with explicit weights_only=True.

Hi! A tiny question: how to get 2.6.0a0 installed. I know how to install night but it gets dev202411xx instead of a0

Anyway, good to use a0 here for now. Once 2.6 is released, we can change it to 2.6.

Hi! A tiny question: how to get 2.6.0a0 installed.

I am getting this building from sources. And <2.6.0 does not work for me on my build. So, 2.6.0a0 is my best effort to get the check working for my current build. I did not know that nightly builds get dev202411xx, I thought they also give a0. I wonder will the check still work for nightly?

I checked. <2.6.0a0 won't work with nightly. So, I switched to a check I ones spotted in a code by Narsil. This should handle both cases, building from sources and using 2.6 nightly (I checked - works for both on my side):

if version.parse(torch.__version__).release < version.parse("2.6").release:

ydshieh

Thanks

Starting from version 2.4 PyTorch introduces a stricter check for the objects which can be loaded with torch.load(). Starting from version 2.6 loading with weights_only=True requires allowlisting of such objects. This commit adds allowlist of some numpy objects used to load model checkpoints. Usage is restricted by context manager. User can still additionally call torch.serialization.add_safe_globals() to add other objects into the safe globals list. Accelerate library also stepped into same problem and addressed it with PR-3036. Fixes: huggingface#34631 See: pytorch/pytorch#137602 See: https://pytorch.org/docs/stable/notes/serialization.html#torch.serialization.add_safe_globals See: huggingface/accelerate#3036 Signed-off-by: Dmitry Rogozhkin <[email protected]>

ArthurZucker · 2024-11-25T09:03:49Z

Thanks for fixing 🤗

Starting from version 2.4 PyTorch introduces a stricter check for the objects which can be loaded with torch.load(). Starting from version 2.6 loading with weights_only=True requires allowlisting of such objects. This commit adds allowlist of some numpy objects used to load model checkpoints. Usage is restricted by context manager. User can still additionally call torch.serialization.add_safe_globals() to add other objects into the safe globals list. Accelerate library also stepped into same problem and addressed it with PR-3036. Fixes: huggingface#34631 See: pytorch/pytorch#137602 See: https://pytorch.org/docs/stable/notes/serialization.html#torch.serialization.add_safe_globals See: huggingface/accelerate#3036 Signed-off-by: Dmitry Rogozhkin <[email protected]>

torch restricted the unpickler to work with torch.Tensors and a few primitive types (https://pytorch.org/docs/stable/notes/serialization.html#weights-only). To work with other types, one can either set weights_only=False (which is unsafe) or add a whitelist of additional types that may be unpickled. See also discussion in huggingface/transformers#34632 where the solution was adopted from.

dvrogozh commented Nov 6, 2024

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

dvrogozh mentioned this pull request Nov 6, 2024

safe_globals are needed to resume training on upcoming PyTorch 2.6 #34631

Closed

dvrogozh force-pushed the safe_globals branch from 2cee855 to fa62472 Compare November 6, 2024 19:49

SunMarc reviewed Nov 15, 2024

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

ArthurZucker reviewed Nov 15, 2024

View reviewed changes

dvrogozh force-pushed the safe_globals branch from fa62472 to 276a3a0 Compare November 16, 2024 00:19

dvrogozh commented Nov 16, 2024

View reviewed changes

dvrogozh force-pushed the safe_globals branch from 276a3a0 to 4273a30 Compare November 19, 2024 17:56

muellerzr approved these changes Nov 20, 2024

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

dvrogozh force-pushed the safe_globals branch from 4273a30 to 468aa06 Compare November 20, 2024 17:52

SunMarc approved these changes Nov 21, 2024

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

dvrogozh force-pushed the safe_globals branch from 468aa06 to 27c307f Compare November 21, 2024 18:10

dvrogozh force-pushed the safe_globals branch from 27c307f to dbb3112 Compare November 21, 2024 18:20

ydshieh reviewed Nov 22, 2024

View reviewed changes

dvrogozh force-pushed the safe_globals branch from dbb3112 to 0505f2c Compare November 22, 2024 15:22

ydshieh approved these changes Nov 22, 2024

View reviewed changes

dvrogozh force-pushed the safe_globals branch from 0505f2c to 820ca4a Compare November 22, 2024 17:02

ArthurZucker approved these changes Nov 25, 2024

View reviewed changes

ArthurZucker merged commit 1339a14 into huggingface:main Nov 25, 2024
24 checks passed

This was referenced Dec 7, 2024

[XPU] model works with 2.5.1 while break with nightly build pytorch/pytorch#142123

Closed

FutureWarning: You are using torch.load with weights_only=False stanfordnlp/stanza#1429

Closed

heyzude mentioned this pull request Mar 20, 2025

[Feature request?] Safe global is needed for loading checkpoint at torch 2.6 volcengine/verl#692

Open

hiyouga mentioned this pull request Apr 27, 2025

无法从中间checkpoint恢复SFT训练 hiyouga/LLaMA-Factory#7868

Closed

1 task

Jintao-Huang mentioned this pull request May 22, 2025

resume_from_checkpoint pickle.UnpicklingError modelscope/ms-swift#4319

Closed

robert-bell mentioned this pull request Dec 11, 2025

fix: handle PyTorch 2.4+ numpy unpickling errors in RNG state loading opendatahub-io/kubeflow-sdk#45

Closed

1 task

Add safe_globals to resume training on PyTorch 2.6 #34632

Add safe_globals to resume training on PyTorch 2.6 #34632

Uh oh!

Conversation

dvrogozh commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dvrogozh commented Nov 13, 2024

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ydshieh commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dvrogozh commented Nov 15, 2024

Uh oh!

ydshieh commented Nov 15, 2024

Uh oh!

mikaylagawarecki commented Nov 15, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

muellerzr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 20, 2024

Uh oh!

dvrogozh commented Nov 20, 2024

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dvrogozh commented Nov 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dvrogozh Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydshieh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dvrogozh commented Nov 6, 2024 •

edited

Loading

ydshieh commented Nov 15, 2024 •

edited

Loading

dvrogozh Nov 22, 2024 •

edited

Loading