Update placement utils and weights to handle meta device #162842

georgiaphillips · 2025-09-12T19:55:55Z

Summary:
This diff fixes two things which come up when testing a tgif-published pt2 model remote net:

Updates isSameDevice to handle meta device to avoid this error:

what():  Unsupported device typemeta and meta
Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20

Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now.

Test Plan:
Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change #1 we get error:

Unsupported device typemeta and meta

Then after change #1 and before change #2 we get:

what():  Mismatched device for merge.user_tower.linear.weight: meta vs cpu
Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374

After change run is successful
Command:

MODEL_ENTITY_ID=921242082
SNAPSHOT_ID=1269
module_name=merge
SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs
buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0"  --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt

Rollback Plan:

Differential Revision: D80713052

pytorch-bot · 2025-09-12T19:55:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162842

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5ff4752 with merge base 38afeb2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-09-12T19:56:07Z

@georgiaphillips has exported this pull request. If you are a Meta employee, you can view the originating diff in D80713052.

georgiaphillips · 2025-09-12T19:56:50Z

@pytorchbot label "topic: not user facing"

facebook-github-bot · 2025-09-12T19:58:41Z

@georgiaphillips has exported this pull request. If you are a Meta employee, you can view the originating diff in D80713052.

Summary: Add cpp pytree node registration for KJTs - python registration already exists in https://www.internalfb.com/code/fbsource/[867a1c6bad82]/fbcode/torchrec/sparse/jagged_tensor.py?lines=3087 This is needed for running nets with KJTs in sigmoid runtime Test Plan: Run load_net_predictor - before got change got error `Unknown pytree node type: torchrec.sparse.jagged_tensor.KeyedJaggedTensor` After change runs successfully ``` MODEL_ENTITY_ID=921242082 SNAPSHOT_ID=1260 module_name=local SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/1260/${module_name}_archive/package/data/sample_inputs buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0" --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt ``` Unit test: ``` fbcode//caffe2/test/cpp/nativert:itree_test -- ITreeTest.JaggedTensorNodeRegistration --run-disabled ``` Rollback Plan: Differential Revision: D80656182

) Summary: This diff fixes two things which come up when testing a tgif-published pt2 model remote net: 1) Updates isSameDevice to handle meta device to avoid error "Unsupported device typemeta and meta" 2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid when caller upstream is replaceMetaTensors. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now. Test Plan: Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change pytorch#1 we get error: ``` Unsupported device typemeta and meta ``` Then after change pytorch#1 and before change pytorch#2 we get: ``` what(): Mismatched device for merge.user_tower.linear.weight: meta vs cpu Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374 ``` After change run is successful Command: ``` MODEL_ENTITY_ID=921242082 SNAPSHOT_ID=1269 module_name=merge SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0" --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt ``` Rollback Plan: Differential Revision: D80713052

facebook-github-bot · 2025-09-12T20:01:03Z

@georgiaphillips has exported this pull request. If you are a Meta employee, you can view the originating diff in D80713052.

facebook-github-bot · 2025-09-17T08:04:49Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-09-17T08:06:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

) Summary: This diff fixes two things which come up when testing a tgif-published pt2 model remote net: 1) Updates isSameDevice to handle meta device to avoid this error: ``` what(): Unsupported device typemeta and meta Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20 ``` 2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now. Test Plan: Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change pytorch#1 we get error: ``` Unsupported device typemeta and meta ``` Then after change pytorch#1 and before change pytorch#2 we get: ``` what(): Mismatched device for merge.user_tower.linear.weight: meta vs cpu Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374 ``` After change run is successful Command: ``` MODEL_ENTITY_ID=921242082 SNAPSHOT_ID=1269 module_name=merge SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0" --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt ``` Rollback Plan: Differential Revision: D80713052 Pull Request resolved: pytorch#162842 Approved by: https://github.com/henryoier

facebook-github-bot added fb-exported meta-exported labels Sep 12, 2025

pytorch-bot bot added the topic: not user facing topic category label Sep 12, 2025

georgiaphillips changed the title ~~Update placement utils and weights to handle meta~~ Update placement utils and weights to handle meta device Sep 12, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 12, 2025

georgiaphillips force-pushed the export-D80713052 branch from 60af766 to 314ff99 Compare September 12, 2025 19:58

georgiaphillips added 2 commits September 12, 2025 13:00

georgiaphillips force-pushed the export-D80713052 branch from 314ff99 to 5ff4752 Compare September 12, 2025 20:00

henryoier approved these changes Sep 15, 2025

View reviewed changes

pytorchmergebot added the merging label Sep 17, 2025

pytorchmergebot closed this in b229455 Sep 17, 2025

pytorchmergebot added Merged and removed merging labels Sep 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update placement utils and weights to handle meta device #162842

Update placement utils and weights to handle meta device #162842

Uh oh!

georgiaphillips commented Sep 12, 2025

Uh oh!

pytorch-bot bot commented Sep 12, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

georgiaphillips commented Sep 12, 2025

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

facebook-github-bot commented Sep 17, 2025

Uh oh!

pytorchmergebot commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Update placement utils and weights to handle meta device #162842

Update placement utils and weights to handle meta device #162842

Uh oh!

Conversation

georgiaphillips commented Sep 12, 2025

Uh oh!

pytorch-bot bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162842

✅ No Failures

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

georgiaphillips commented Sep 12, 2025

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

facebook-github-bot commented Sep 17, 2025

Uh oh!

pytorchmergebot commented Sep 17, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot bot commented Sep 12, 2025 •

edited

Loading