-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Describe the bug
VariableTracker is supposed to be immutable, so we can do checkpointing and rollback when instruction translation fails and we need to create a graph break.
However, we (indirectly) store fake tensor in VariableTracker, and fake tensor is not treated immutably. In particular, when we run nodes on fake tensor, those nodes may modify the metadata of the input fake tensor, and I don't see any logic that appropriately duplicates fake tensors so that we can safely run modifications on them.
The hypothetical bug is that we checkpoint, we then run the symbolic evaluator on squeeze_, and that mutates the fake tensor, but then we discover we need to rollback, and now subsequent code will see incorrect metadata for the input to squeeze. Discussing this with jansel, this is unlikely to happen in practice today, because the only time we rollback is right before we are generating a graph break, and at that point we're not going to actually insert any more nodes into the graph. But it will be good to get this correct, in case we end up using checkpointing more seriously in the future. More generally, we probably need some sort of lint to make sure that VariableTrackers do not actually get mutated, because they're not going to turn into real bugs immediately if you corrupt the state.
cc @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire @eellison @jansel
Error logs
No response
Minified repro
No response