Debug test failure for separate I, W execution #138863

wconstab · 2024-10-24T23:05:09Z

Stack from ghstack (oldest at bottom):

Test added in prev PR runs a simple schedule on one rank and using one
pp stage. The schedule has 2 microbatches and runs F, I, W ops
separately for each microbatch. Then the test compares .grad for the
pipelined module to a reference module.

The test fails becuase the pipelined module has 'None' grads. The debug
logs show that the executor is running the I and W steps, and additional
print statements show that the underlying backward_ functions are being
called as expected.

I identified that the 'param_groups' saved during backward_input were
empty-list, and I think that is why there are no gradients computed
during backward. This contradicts one comment in the code. There is a
special case for stage 0 to execute 'full' backward instead of 'weight'
backward when running 'backward_weight_one_chunk. I tried running both
the if and else branch of this logic with the same result, 'None' grads.
The special case seems to fit my case- I only have 1 stage, so it is
stage 0. But running 'full' backward for dW does not seem to help.

repro:
TORCH_LOGS=+pp python test/distributed/pipelining/test_schedule.py -k test_grad_with_split_b_w

[ghstack-poisoned]

pytorch-bot · 2024-10-24T23:05:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138863

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

❌ 6 New Failures

As of commit 5d5f6b5 with merge base failed to retrieve merge base, please contact dev infra:

NEW FAILURES - The following jobs have failed:

Check Labels / Check labels (gh)
# This PR needs a release notes: label
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/pipelining/test_schedule.py::TestScheduleLowering::test_grad_with_split_b_w
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/pipelining/test_schedule_multiproc.py::ScheduleTest::test_grad_with_manual_interleaved_ScheduleClass2_use_new_runtime_True
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4_use_new_runtime_False
pull / linux-focal-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh)
Build completed, 1 test FAILED, 3216 total actions
pull / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh)
Build completed, 1 test FAILED, 3217 total actions

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2024-10-24T23:06:27Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

Ran into issues (#138863) when adding a Schedule with single stage for zero bubble, adding code to support this mostly for test purposes cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Ran into issues (#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes) cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Ran into issues (#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes) Pull Request resolved: #138925 Approved by: https://github.com/wconstab

Ran into issues (pytorch#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes) Pull Request resolved: pytorch#138925 Approved by: https://github.com/wconstab

Test added in prev PR runs a simple schedule on one rank and using one pp stage. The schedule has 2 microbatches and runs F, I, W ops separately for each microbatch. Then the test compares .grad for the pipelined module to a reference module. The test fails becuase the pipelined module has 'None' grads. The debug logs show that the executor is running the I and W steps, and additional print statements show that the underlying backward_ functions are being called as expected. I identified that the 'param_groups' saved during backward_input were empty-list, and I think that is why there are no gradients computed during backward. This contradicts one comment in the code. There is a special case for stage 0 to execute 'full' backward instead of 'weight' backward when running 'backward_weight_one_chunk. I tried running both the if and else branch of this logic with the same result, 'None' grads. The special case seems to fit my case- I only have 1 stage, so it is stage 0. But running 'full' backward for dW does not seem to help. repro: `TORCH_LOGS=+pp python test/distributed/pipelining/test_schedule.py -k test_grad_with_split_b_w` ghstack-source-id: ef0fc80 Pull Request resolved: pytorch/pytorch#138863

Update

14342da

[ghstack-poisoned]

wconstab mentioned this pull request Oct 24, 2024

[Pipelining] add schedule simulator and chrometrace dump #138134

Closed

This was referenced Oct 24, 2024

[Pipelining] Relax multi-stage constraint #138862

Closed

[Pipelining] Support separate dI / dW and V-schedules #131762

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 24, 2024

wconstab mentioned this pull request Oct 24, 2024

[Pipelining] Support V-schedules in IR and Runtime #138125

Closed

wconstab mentioned this pull request Oct 25, 2024

[Pipelining] Update schedules to use I, B actions. #138886

Closed

Update

5d5f6b5

[ghstack-poisoned]

wconstab changed the title ~~Debug test failure for separate I, W execution~~ [not for merge] Debug test failure for separate I, W execution Oct 25, 2024

This was referenced Oct 25, 2024

[Pipelining] Optimize ready_to_schedule logic #138924

Closed

[Pipelining] Remove unused special case from simulator #138928

Closed

wconstab changed the title ~~[not for merge] Debug test failure for separate I, W execution~~ Debug test failure for separate I, W execution Oct 25, 2024

H-Huang mentioned this pull request Oct 28, 2024

Allow schedules to run with single stage #138925

Closed

wconstab closed this Oct 31, 2024

wconstab deleted the gh/wconstab/350/head branch October 31, 2024 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Debug test failure for separate I, W execution #138863

Debug test failure for separate I, W execution #138863

Uh oh!

wconstab commented Oct 24, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 24, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Oct 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Debug test failure for separate I, W execution #138863

Debug test failure for separate I, W execution #138863

Uh oh!

Conversation

wconstab commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138863

❗ 1 Active SEVs

❌ 6 New Failures

Uh oh!

github-actions bot commented Oct 24, 2024

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wconstab commented Oct 24, 2024 •

edited

Loading

pytorch-bot bot commented Oct 24, 2024 •

edited

Loading

This PR needs a `release notes:` label