-
Notifications
You must be signed in to change notification settings - Fork 26.3k
feat(inductor): Improve compilation speed and add SGD Optimizer back to Inductor
#110353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110353
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit ebe7daf with merge base cf1b494 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
SGD Optimizer back to Inductor
|
@pytorchbot label "topic: not user facing" |
|
Hmm, dlrm fails with: DetailsSeems like accessing sparse tensor storage generally has poor support: #108667 #106837 The issue does seem to be cropping up more frequently lately (3 in the last 2 months, including this PR). |
|
@jon-chuang typically optimizers operate on hundreds of parameters in our benchmarks, I've been benchmarking them on models with ~1k parameters to provide good data (this number is more representative of what we have in our OSS benchmarks). It currently takes about 2 minutes to compile an optimizer with 1k parameters. This is on par with Jax, but there are a few improvements in flight which should improve this |
Would you be able to share the benchmark results? I would like to investigate what might be causing the slow compile times. Even a simple repro script would be rly great (I could also try to come up with a repro)
I guess this is slower in the single (non foreach) case? |
|
Hmm, I investigated locally with the following script: import time
import torch
from torch.optim import Adam, SGD
def compile_opt(opt_compiled):
torch._dynamo.eval_frame.TorchPatcher.patch()
step_fn = opt_compiled.step.__wrapped__
def fn():
step_fn(opt_compiled)
return torch.compile(fn, backend="inductor", fullgraph=True)
optim_cls = SGD
NUM_PARAMS = 1000
kwargs = { "lr": 0.01, "foreach": True }
torch._dynamo.reset()
# torch._inductor.metrics.reset()
input = torch.ones([10, 10], device="cuda:0")
model = torch.nn.Sequential(
*[torch.nn.Linear(10, 10, device="cuda:0") for _ in range(NUM_PARAMS)]
)
input = torch.ones([10, 10], device="cuda:0")
model(input).sum().backward()
opt_compiled = optim_cls(model.parameters(), **kwargs)
compiled_step = compile_opt(opt_compiled)
with torch.set_grad_enabled(False):
start_time = time.time()
compiled_step()
print("compile opt took: %s seconds", time.time() - start_time)ResultsNUM_PARAMS = 200
NUM_PARAMS = 1000
Scaling HypothesisScaling hypothesis for SGD: dynamo: 7s -> 162s ~24x = 5^2. So we expect dynamo cost to scale quadratically with num_params. This is likely due to the expensive for loop in calling "step". This is the loop in question: Line 39 in 31d6358
ConclusionSo if we talk about compile times, Adam is actually much slower. It seems the bottleneck is dynamo. |
|
@mlazos, I am unable to reproduce your results. Of the optimizers, SGD compiles the fastest. I applied more optimizations in this branch to make it faster.
Which case are you seeing in the benchmarks? (foreach=True or False?) |
SGD Optimizer back to InductorSGD Optimizer back to Inductor
| grouped_tensors = Optimizer._group_tensors_by_device_and_dtype([params, grads, momentum_buffer_list], with_indices=True) | ||
| for ((device_params, device_grads, device_momentum_buffer_list), indices) in grouped_tensors.values(): | ||
| device_has_sparse_grad = any(grad.is_sparse for grad in device_grads) | ||
| device_has_sparse_grad = has_sparse_grad and any(grad.is_sparse for grad in device_grads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you move this out to a separate PR, I can approve that as the discussion to re-enable continues here.
Separately, I would love to get optimizer compile time back into the PT2 benchmarks so am in support of getting SGD down at the very least. If general optimizers take too long everywhere, the other option I was thinking about is to try all compilable optimizers for just one model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Standalone: #110648
| disabled_multi_tensor_opt_modules = { | ||
| adamax, | ||
| radam, # data-dependent control flow | ||
| sgd, # for now, until we can speed up compilation (this affects the benchmarks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to understand the impact in runtime for benchmarks with this line of change. To my knowledge, a few hours for all the benchmarks is acceptable.
#110648) Originated: #110353 (comment) Speeds up significantly in non-sparse path (majority use-case). Benchmarks: #110506 (comment) CC: @janeyx99 Pull Request resolved: #110648 Approved by: https://github.com/janeyx99
|
I manually kicked off some benchmarks to see the difference it'd make in #111341 and got these results on the torchinductor HUD page: I'm not entirely well-versed in interpreting these results yet, AND the manual run only ran a subset of configs, but it looks like the speedups do not seem worth it for SGD given the increase in compile-time. This is not news--vanilla SGD is just one instruction, so fusing doesn't do much. Thus, if the e2e benchmarks on this page are meant to show the "most performant" feature combination, it may be not worth adding compilation to the benchmarks, at least not for SGD. We can revisit adding a compiled optimizer to the benchmarks later. In the meantime, I do think something that is quite useful to do is pick a subset of models (we could just start with resnet50) and then run all our supported flavors of compiled optimizers. We would want to measure E2E compilation + run time, with breakdowns of each (e.g., dynamo, inductor for compile time). I see at least two ways this could be done:
|
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |

Originally disabled in #105438
However, compilation times are plenty speedy to me:

Could have benefited from perf changes to scheduler like cycle-detection optimizations.
CC @mlazos as original code author
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler