Skip to content

Conversation

@sraikund16
Copy link
Contributor

@sraikund16 sraikund16 commented Nov 12, 2024

Summary:
It seems like this issues is due to leftover cupti events during warmup staying persistent in the queue during profiling. These events start before our actual time window and therefore have a timestamp lower than our basetime. This makes the delta become negative which results in unsigned overflow. This then creates a large number which later gets sign added which creates the signed overflow.

Solution: If a raw timestamp is less than the base timestamp, just mark the process timestamp as -1 so we can mark these events as "to ignore". In Kineto, add a special case to ignore timestamps that are negative.

Test Plan: Test with ASAN

Differential Revision: D65835650

cc @robieta @chaekit @guotuofeng @guyang3532 @dzhulgakov @davidberard98 @briancoutinho @sanrise

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 12, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140441

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 85c0a65 with merge base 3d2dd14 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65835650

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65835650

@sraikund16 sraikund16 added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category oncall: profiler profiler-related issues (cpu, gpu, kineto) release notes: profiler release notes category and removed oncall: profiler profiler-related issues (cpu, gpu, kineto) release notes: profiler release notes category labels Nov 12, 2024
Summary:

It seems like this issues is due to leftover cupti events during warmup staying persistent in the queue during profiling. These events start before our actual time window and therefore have a timestamp lower than our basetime. This makes the delta become negative which results in unsigned overflow. This then creates a large number which later gets sign added which creates the signed overflow.

Solution: If a raw timestamp is less than the base timestamp, just mark the process timestamp as -1 so we can mark these events as "to ignore". In Kineto, these events will be filtered as they will be out of range

Test Plan: Tested with ASAN and was able to get a reasonable profiler. Also tested basic resnet test and only events in range show up

Differential Revision: D65835650
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65835650

return [=](approx_time_t t_approx) {
// See above for why this is more stable than `A * t_approx + B`.
return (time_t)((double)(t_approx - t0_approx) * scale_factor) + t0;
return t_approx > t0_approx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably do

(time_t)(((double) t_approx - t0_approx) * scale_factor) + t0

without really affecting the precision, since in the original code you're already casting to double.

But if you prefer the current approach that's probably fine too

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
Summary:
It seems like this issues is due to leftover cupti events during warmup staying persistent in the queue during profiling. These events start before our actual time window and therefore have a timestamp lower than our basetime. This makes the delta become negative which results in unsigned overflow. This then creates a large number which later gets sign added which creates the signed overflow.

Solution: If a raw timestamp is less than the base timestamp, just mark the process timestamp as -1 so we can mark these events as "to ignore". In Kineto, add a special case to ignore timestamps that are negative.

Test Plan: Test with ASAN

Differential Revision: D65835650

Pull Request resolved: pytorch#140441
Approved by: https://github.com/davidberard98
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants