feat(optimizer): `Adagrad` will use `device` when `capturable` - True always when compiling with dynamo #110339

jon-chuang · 2023-09-30T19:15:22Z

Partial fix: #107006

CC: @mlazos as issue creator

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

pytorch-bot · 2023-09-30T19:15:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110339

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 359aed7 with merge base 428cbd7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/optim/adagrad.py

…uang/fix-adagrad-cpu-param

jon-chuang · 2023-10-03T13:47:50Z

CC @janeyx99 I applied the "capturable" flag to this PR to remove the cpu tensors from tracing path, as opposed to hardcoding a device tensor.

Questions:

Do I also need to run benchmarks in this case? This ought not to change eager mode perf. Regarding: feat(inductor): Improve Adamax to be better fused by Inductor and enable it #110345 (comment)

torch/optim/adagrad.pyi

jon-chuang · 2023-10-03T21:53:02Z

torch/optim/adagrad.py

        )
        super().__init__(params, defaults)

-        for group in self.param_groups:


Note to reviewer:

Thanks to this change (required for lazily sending "step" to device as needed), a previous error in test/nn/test_lazy_modules.py for lazy initialization is now avoided.

test/inductor/test_compiled_optimizers.py

jon-chuang · 2023-10-04T23:07:51Z

test/nn/test_lazy_modules.py

            module.register_parameter('test_param', UninitializedParameter())
            if optim_cls is torch.optim.SGD:
                optim = optim_cls(module.parameters(), lr=0.0)
-            elif optim_cls is torch.optim.Adagrad:


Note to reviewer:

As mentioned below, as we now lazily initialize step, this previous error in test/nn/test_lazy_modules.py for lazy initialization is now avoided.

@janeyx99

… against list comprehensions (e.g. complex conversion) (#110613) Fully fixes: #110506 Depends: #110607 Potential merge conflicts: - #110339 - #110345 - #110454 Related: - #110606 (we can apply the improvements here orthogonally to the complex support) ### Results Benchmark: 100 params. Breakdowns (float32, dynamo): ``` Adagrad: this PR: 4.4s, main: 8.8s Adam: this PR: 2.1s, main: 9.8s AdamW: this PR: 2.5s, main: 8.2s ASGD: this PR: 3.1s, main: 8.5s RMSProp: this PR: 1.3s, main: 4.2s RProp: this PR: 6.7s, main: 14.9s ``` Notes: 1. Adagrad is still slow due to `_get_value` list comprehension. Can be fixed in https://github.com/pytorch/pytorch/pull/110339/files by utilizing capturable path 2. Adamax is not actually compiled (it is currently disabled). 3. Inductor compile time is quite variable. We calculate dynamo by subtracting `call_user_compiler` from `compile_inner` timing. <details> This PR: ``` Adagrad (torch.float32): 28.47496461868286s Adagrad (torch.complex64): 29.379547357559204s Adam (torch.float32): 17.334211587905884s Adam (torch.complex64): 29.637500524520874s Adamax (torch.float32): 2.4749321937561035s Adamax (torch.complex64): 3.1997995376586914s AdamW (torch.float32): 18.06532859802246s AdamW (torch.complex64): 28.25661015510559s ASGD (torch.float32): 23.70255398750305s ASGD (torch.complex64): 25.33756995201111s RMSprop (torch.float32): 7.964028596878052s RMSprop (torch.complex64): 12.909599781036377s Rprop (torch.float32): 30.512362003326416s Rprop (torch.complex64): 44.74405765533447s ``` Main ``` Adagrad (torch.float32): 26.919506072998047s Adagrad (torch.complex64): 35.190622091293335s Adam (torch.float32): 25.715000867843628s Adam (torch.complex64): 24.17716670036316s Adamax (torch.float32): 2.4404726028442383s Adamax (torch.complex64): 3.3538928031921387s AdamW (torch.float32): 25.2022807598114s AdamW (torch.complex64): 28.915700912475586s ASGD (torch.float32): 24.108731985092163s ASGD (torch.complex64): 26.589075088500977s RMSprop (torch.float32): 10.781344175338745s RMSprop (torch.complex64): 15.136352777481079s Rprop (torch.float32): 42.46482181549072s Rprop (torch.complex64): 48.28277635574341s ``` Seems that it doesn't help the complex case by much (but that's not the majority case). torch.float32 is generally positive, when it does not show drastic improvement / regresses, it is due to inductor variance (by manually inspecting the logs). </details> ### Benchmark Script ```python import torch import time from torch.optim import Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop OPTIMS = [Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop] DTYPES = [torch.float, torch.cfloat] NUM_PARAMS = 100 kwargs = { "lr": 0.01, "foreach": True } summary = [] for optim_cls in OPTIMS: for dtype in DTYPES: torch._dynamo.reset() # torch._inductor.metrics.reset() input = torch.ones([10, 10], dtype=dtype, device="cuda:0") model = torch.nn.Sequential( *[torch.nn.Linear(10, 10, dtype=dtype, device="cuda:0") for _ in range(NUM_PARAMS)] ) model(input).sum().abs().backward() opt_compiled = optim_cls(model.parameters(), **kwargs) compiled_step = torch.compile(opt_compiled.step) with torch.set_grad_enabled(False): start_time = time.time() compiled_step() summary.append(f"{optim_cls.__name__} ({dtype}): {time.time() - start_time}s") print(optim_cls, kwargs, dtype, torch._dynamo.utils.compile_times()) for s in summary: print(s) ``` CC: @janeyx99 @mlazos Pull Request resolved: #110613 Approved by: https://github.com/janeyx99

jon-chuang · 2023-10-07T10:04:39Z

Ping @janeyx99 @mlazos

With the other issues out of the way, I hope we're able to tackle the original investigation into improving fusion of optimizers!

janeyx99

Thanks for adding the capturable path. Adding capturable is heftier as we should support for single tensor as well as adding cuda graphs testing. Feel free to follow the example in #106615 for what is expected. It might make sense to open general capturable in another PR and have this one be the compiler stuff built on top of it.

janeyx99 · 2023-10-09T12:20:13Z

torch/optim/adagrad.py

+            state = self.state[p]
+            if "step" not in state:
+                state["step"] = (
+                    torch.zeros((), dtype=torch.float, device=p.device)


let’s use p.new_zeros to maintain subclass information of p

@jansel

…=fusion` (#110415) Fixes #110393 Example logs (for adagrad on main). In this case, it clearly identifies device mismatch as a potential red flag, which is indeed the obstacle to adagrad's successful fusion. (see: #110339) ``` [2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== attempting fusion (1/10): 18 nodes ===== [2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] 13 possible fusions: [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7)) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf8')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf10')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf12')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf14')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf9')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf11')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf13')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf15')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf25'), SchedulerNode(name='buf33')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf43'), SchedulerNode(name='buf51')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf34'), SchedulerNode(name='buf42')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf16'), SchedulerNode(name='buf24')) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] completed fusion round (1/10): fused 18 nodes into 5 nodes [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== attempting fusion (2/10): 5 nodes ===== [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] 0 possible fusions: [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] completed fusion round (2/10): fused 5 nodes into 5 nodes [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== fusion complete (2 iterations) ===== ``` CC @jansel @ngimel @mlazos @shunting314 @peterbell10 as code owners Pull Request resolved: #110415 Approved by: https://github.com/mlazos

github-actions · 2023-12-08T12:41:00Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

jon-chuang requested review from albanD and janeyx99 as code owners September 30, 2023 19:15

pytorch-bot bot added the release notes: optim label Sep 30, 2023

github-actions bot added the module: inductor label Sep 30, 2023

pytorchbot added the open source label Sep 30, 2023

fix

3327015

jon-chuang mentioned this pull request Oct 1, 2023

feat(inductor): Improve Adamax to be better fused by Inductor and enable it #110345

Closed

jon-chuang commented Oct 1, 2023

View reviewed changes

torch/optim/adagrad.py Outdated Show resolved Hide resolved

jon-chuang changed the title ~~fix(optimizer): Adagrad state param does not follow inputs' device~~ fix(optimizer): Adagrad state param does not follow inputs' device Oct 1, 2023

jon-chuang mentioned this pull request Oct 3, 2023

[inductor]: Better debugging of can_fuse decisions with TORCH_LOGS=fusion #110415

Closed

Merge branch 'main' of https://github.com/pytorch/pytorch into jon-ch…

b25e2f2

…uang/fix-adagrad-cpu-param

jon-chuang force-pushed the jon-chuang/fix-adagrad-cpu-param branch from 52123c3 to b25e2f2 Compare October 3, 2023 13:09

capturable flag to use device only when compile

2d451c6

jon-chuang changed the title ~~fix(optimizer): Adagrad state param does not follow inputs' device~~ feat(optimizer): Adagrad will use device only when capturable - True always when compiling with dynamo Oct 3, 2023

jon-chuang changed the title ~~feat(optimizer): Adagrad will use device only when capturable - True always when compiling with dynamo~~ feat(optimizer): Adagrad will use device when capturable - True always when compiling with dynamo Oct 3, 2023

add default capturable

d55498b

jon-chuang mentioned this pull request Oct 3, 2023

feat(inductor): Add RAdam to Inductor by converting data-dependent control-flow to torch.where #110351

Closed

remove test exclude

a8a227b

jon-chuang commented Oct 3, 2023

View reviewed changes

torch/optim/adagrad.pyi Show resolved Hide resolved

jon-chuang commented Oct 3, 2023

View reviewed changes

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 3, 2023

jon-chuang commented Oct 4, 2023

View reviewed changes

test/inductor/test_compiled_optimizers.py Show resolved Hide resolved

add more tests

359aed7

jon-chuang commented Oct 4, 2023

View reviewed changes

This was referenced Oct 5, 2023

perf(inductor): move optim complex type conversions into _init_group #110604

Closed

perf(inductor): use for loop with shortcut in Optimizers to speedup against list comprehensions (e.g. complex conversion) #110613

Closed

janeyx99 reviewed Oct 9, 2023

View reviewed changes

github-actions bot added the Stale label Dec 8, 2023

github-actions bot closed this Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(optimizer): `Adagrad` will use `device` when `capturable` - True always when compiling with dynamo #110339

feat(optimizer): `Adagrad` will use `device` when `capturable` - True always when compiling with dynamo #110339

Uh oh!

jon-chuang commented Sep 30, 2023 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 30, 2023 •

edited

Loading

Uh oh!

Uh oh!

jon-chuang commented Oct 3, 2023 •

edited

Loading

Uh oh!

Uh oh!

jon-chuang Oct 3, 2023 •

edited

Loading

Uh oh!

Uh oh!

jon-chuang Oct 4, 2023

Uh oh!

jon-chuang commented Oct 7, 2023

Uh oh!

janeyx99 left a comment

Uh oh!

janeyx99 Oct 9, 2023

Uh oh!

github-actions bot commented Dec 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(optimizer): Adagrad will use device when capturable - True always when compiling with dynamo #110339

feat(optimizer): Adagrad will use device when capturable - True always when compiling with dynamo #110339

Uh oh!

Conversation

jon-chuang commented Sep 30, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110339

✅ No Failures

Uh oh!

Uh oh!

jon-chuang commented Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jon-chuang Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jon-chuang Oct 4, 2023

Choose a reason for hiding this comment

Uh oh!

jon-chuang commented Oct 7, 2023

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

janeyx99 Oct 9, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(optimizer): `Adagrad` will use `device` when `capturable` - True always when compiling with dynamo #110339

feat(optimizer): `Adagrad` will use `device` when `capturable` - True always when compiling with dynamo #110339

jon-chuang commented Sep 30, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 30, 2023 •

edited

Loading

jon-chuang commented Oct 3, 2023 •

edited

Loading

jon-chuang Oct 3, 2023 •

edited

Loading