-
Notifications
You must be signed in to change notification settings - Fork 26.3k
perf(inductor): improve Adam compile times drastically by moving list comprehensions into _init_group
#110596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(inductor): improve Adam compile times drastically by moving list comprehensions into _init_group
#110596
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110596
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 103fa72 with merge base cf1b494 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
janeyx99
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:o wow
This looks pretty good to me. Could you run the numbers a few times to have more statistically significant evidence?
@mlazos are there concerns with moving more things into _init_group?
Yes, I've run multiple times, sometimes inductor times change (it's quite variable) but dynamo times are quite invariant and improve by the same amount. |
janeyx99
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I've discovered a problem--people call our functional APIs (adam, single_tensor_adam, multi_tensor_adam) and expect complex tensors to work. In this PR, that would no longer work :/
It's great you identified these as slow bottlenecks though--we should work on getting dynamo to skip compiling this somehow.
Right, that makes sense. In this case, we can do the conversion in loop using the alternate strategy, which is also faster. |
|
Superceded by: #110607 |
Adam part of: #110506
TODO:
complexvia list comprehensions_init_group) is better than alternate (convert list comprehensions into for loop with single branch)_init_groupmethod seems more elegant, as it reduces duplicate code insingleandmulti_tensorcases. It's also faster.Results:
NUM_PARAMS=200, foreach=TrueNUM_PARAMS=200, foreach=FalseResults (With logs enabled):
Details
`NUM_PARAMS=200, foreach=True,TORCH_LOGS=+dynamo,schedule` - main: dynamo: 48s, inductor: 32s, total: 80s - this PR: dynamo: 20s, inductor: 32s, total: 52s (speedup: 28s)NUM_PARAMS=200, foreach=False,TORCH_LOGS=+dynamo,scheduleBenchmark script
Alternate Strategy
Use for loops rather than list comprehensions
CC: @janeyx99 @mlazos