Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Oct 24, 2024

Stack from ghstack (oldest at bottom):

Test added in prev PR runs a simple schedule on one rank and using one
pp stage. The schedule has 2 microbatches and runs F, I, W ops
separately for each microbatch. Then the test compares .grad for the
pipelined module to a reference module.

The test fails becuase the pipelined module has 'None' grads. The debug
logs show that the executor is running the I and W steps, and additional
print statements show that the underlying backward_ functions are being
called as expected.

I identified that the 'param_groups' saved during backward_input were
empty-list, and I think that is why there are no gradients computed
during backward. This contradicts one comment in the code. There is a
special case for stage 0 to execute 'full' backward instead of 'weight'
backward when running 'backward_weight_one_chunk. I tried running both
the if and else branch of this logic with the same result, 'None' grads.
The special case seems to fit my case- I only have 1 stage, so it is
stage 0. But running 'full' backward for dW does not seem to help.

repro:
TORCH_LOGS=+pp python test/distributed/pipelining/test_schedule.py -k test_grad_with_split_b_w

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138863

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 6 New Failures

As of commit 5d5f6b5 with merge base failed to retrieve merge base, please contact dev infra:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions
Copy link
Contributor

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]
@wconstab wconstab changed the title Debug test failure for separate I, W execution [not for merge] Debug test failure for separate I, W execution Oct 25, 2024
@wconstab wconstab changed the title [not for merge] Debug test failure for separate I, W execution Debug test failure for separate I, W execution Oct 25, 2024
H-Huang added a commit that referenced this pull request Oct 28, 2024
Ran into issues (#138863) when adding a Schedule with single stage for zero bubble, adding code to support this mostly for test purposes

cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Oct 28, 2024
Ran into issues (#138863) when adding a Schedule with single stage for zero bubble, adding code to support this mostly for test purposes

cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Oct 28, 2024
Ran into issues (#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes)

cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Oct 28, 2024
Ran into issues (#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes)

cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Oct 30, 2024
Ran into issues (#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes)

Pull Request resolved: #138925
Approved by: https://github.com/wconstab
@wconstab wconstab closed this Oct 31, 2024
@wconstab wconstab deleted the gh/wconstab/350/head branch October 31, 2024 16:15
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
Ran into issues (pytorch#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes)

Pull Request resolved: pytorch#138925
Approved by: https://github.com/wconstab
Esquains pushed a commit to Esquains/study1 that referenced this pull request Dec 15, 2024
Test added in prev PR runs a simple schedule on one rank and using one
pp stage.  The schedule has 2 microbatches and runs F, I, W ops
separately for each microbatch.  Then the test compares .grad for the
pipelined module to a reference module.

The test fails becuase the pipelined module has 'None' grads.  The debug
logs show that the executor is running the I and W steps, and additional
print statements show that the underlying backward_ functions are being
called as expected.

I identified that the 'param_groups' saved during backward_input were
empty-list, and I think that is why there are no gradients computed
during backward.  This contradicts one comment in the code.  There is a
special case for stage 0 to execute 'full' backward instead of 'weight'
backward when running 'backward_weight_one_chunk.  I tried running both
the if and else branch of this logic with the same result, 'None' grads.
The special case seems to fit my case- I only have 1 stage, so it is
stage 0.  But running 'full' backward for dW does not seem to help.

repro:
`TORCH_LOGS=+pp python test/distributed/pipelining/test_schedule.py -k test_grad_with_split_b_w`

ghstack-source-id: ef0fc80
Pull Request resolved: pytorch/pytorch#138863
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant