Skip to content

Conversation

@c-p-i-o
Copy link
Contributor

@c-p-i-o c-p-i-o commented Dec 5, 2024

Summary:
This is an attempt to improve the flight recorder efficacy.
We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a std::exception - broken promise.

There are two changes in here.

  1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack.
    Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace.
  2. Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing.

TODO:

  • there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not.
    This info might be useful for the analyzer - so I'll add this in a follow on diff.

Test Plan: Unit tests.

Differential Revision: D66843194

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 5, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142178

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a5411f7 with merge base c6e18a1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Dec 5, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66843194

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66843194

@c-p-i-o c-p-i-o self-assigned this Dec 5, 2024
@c-p-i-o c-p-i-o requested review from fduwjj and kwen2501 December 5, 2024 23:16
@kwen2501 kwen2501 changed the title [RFC][BE] Improve Flight Recorder efficacy [BE] Improve Flight Recorder efficacy Dec 5, 2024
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 5, 2024
c-p-i-o added a commit to c-p-i-o/pytorch that referenced this pull request Dec 5, 2024
Summary:

This is an attempt to improve the flight recorder efficacy. 
We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`.

There are two changes in here.
1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack.
Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace.
2.  Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing.


TODO:
- there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not.
This info might be useful for the analyzer - so I'll add this in a follow on diff.

Test Plan: Unit tests.

Differential Revision: D66843194
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66843194

@c-p-i-o c-p-i-o requested a review from wconstab December 6, 2024 00:06
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Have some minor comments.

// stack traces.
auto status = promiseFlightRecorderDump_.get_future().wait_for(
std::chrono::milliseconds(waitTimeoutDumpInMilSec_));
std::chrono::milliseconds(2 * waitTimeoutDumpInMilSec_));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: waitTimeoutDumpInMilSec_ is not used elsewhere, maybe we can directly bump its value by 2x? (easier maintenance I guess)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the watchdog needs to give 2 mins for heartbeat thread to try to dump (1 min for with stack trace and if this fails, 1 min without stack trace).

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66843194

c-p-i-o added a commit to c-p-i-o/pytorch that referenced this pull request Dec 6, 2024
Summary:

This is an attempt to improve the flight recorder efficacy. 
We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`.

There are two changes in here.
1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack.
Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace.
2.  Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing.


TODO:
- there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not.
This info might be useful for the analyzer - so I'll add this in a follow on diff.

Test Plan: Unit tests.

Reviewed By: kwen2501

Differential Revision: D66843194
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66843194

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to do a roll out on this or it is ok that we directly have it in the code?

@netlify
Copy link

netlify bot commented Dec 6, 2024

Deploy Preview for chimerical-cranachan-793287 ready!

Name Link
🔨 Latest commit a5411f7
🔍 Latest deploy log https://app.netlify.com/sites/chimerical-cranachan-793287/deploys/67539d88f7d8fe00080c3104
😎 Deploy Preview https://deploy-preview-142178--chimerical-cranachan-793287.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66843194

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66843194

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66843194

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66843194

Summary:

This is an attempt to improve the flight recorder efficacy. 
We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`.

There are two changes in here.
1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack.
Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace.
2.  Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing.


TODO:
- there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not.
This info might be useful for the analyzer - so I'll add this in a follow on diff.

Test Plan: Unit tests.

Reviewed By: kwen2501

Differential Revision: D66843194
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66843194

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

AmdSampsa pushed a commit to AmdSampsa/pytorch that referenced this pull request Dec 9, 2024
Summary:
This is an attempt to improve the flight recorder efficacy.
We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`.

There are two changes in here.
1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack.
Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace.
2.  Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing.

TODO:
- there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not.
This info might be useful for the analyzer - so I'll add this in a follow on diff.

Test Plan: Unit tests.

Differential Revision: D66843194

Pull Request resolved: pytorch#142178
Approved by: https://github.com/kwen2501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants