-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[POC] AOTInductor as Inductor backend #141700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141700
Note: Links to docs will display an error until the docs builds have been completed. ❌ 53 New Failures, 2 Unrelated FailuresAs of commit d058fc7 with merge base 0f261e8 ( NEW FAILURES - The following jobs have failed:
UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: e88a8d6 Pull Request resolved: #141700
This PR needs a
|
| if graph is None: | ||
| return None, cache_info | ||
|
|
||
| # See _save_graph(); we don't store the callable in the cache entry so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review this with whitespace changes removed. I should have done this refactor before doing the prototype but I didn't realize I needed to hit this
|
|
||
|
|
||
|
|
||
| torch._inductor.config.aoti_wrapper = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ABSOLUTELY not what the final API should be, this is enough jank to get the test script to run
| key = get_hash(":".join(compiled_fn), "", "code") | ||
| basename, subdir, path = get_path(key, "pt2", "") | ||
| pathlib.Path(path).parent.mkdir(parents=True, exist_ok=True) | ||
| output_code_log.debug("Package written to %s", path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Need to do some CAS refactoring to make this better. package_aoti isn't really cooperating either. This needs to be atomic, not enough guts exposed.
| @@ -0,0 +1,23 @@ | |||
| import torch | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support @nocommit? I assume this file shouldn't make it into the final PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a POC, it needs lots of cleaning to actually land, don't worry about it
| lambda: {"filename": artifact_path}, | ||
| payload_fn=lambda: code, | ||
| ) | ||
| # TODO: This shoujld increment all the time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo
| # TODO: This could be better if we're ever able to serialize compiled | ||
| # models to disk. | ||
| disk_compiled_graph.current_callable = None | ||
| disk_compiled_graph.clear_uncacheable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #141502 I do a similar change (I called it prepare_for_serialization - but same thing)
Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: c10f5b0 Pull Request resolved: #141700
…rch#145381) Summary: Pull Request resolved: pytorch#145381 In this diff we are trying to introduce some stateful API to enable "fullgraph_package" mode which will force inductor to use AOTI as a backend. Different from PR pytorch#141700, we didn't try to populate the package file into caching system, instead we bypass caching to simplify the implementation in the current form. Similar to PR pytorch#141700, I did a quick benchmark to the loading time and it looks like the following: - Precompile ``` buck run mode/opt scripts/zhxchen17:precompile ``` - Load using cache: ``` time buck run mode/opt scripts/zhxchen17:precompile -- --loader cache ``` Output: ``` real 0m24.593s user 0m59.342s sys 0m17.201s ``` - Load using load_fullgraph_package ``` time buck run mode/opt scripts/zhxchen17:precompile -- --loader precompile ``` Output: ``` real 0m10.907s user 0m9.210s sys 0m1.173s ``` Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_fullgraph_package_basic _function Differential Revision: D68459341
…rch#145381) Summary: In this diff we are trying to introduce some stateful API to enable "fullgraph_package" mode which will force inductor to use AOTI as a backend. Different from PR pytorch#141700, we didn't try to populate the package file into caching system, instead we bypass caching to simplify the implementation in the current form. Similar to PR pytorch#141700, I did a quick benchmark to the loading time and it looks like the following: - Precompile ``` buck run mode/opt scripts/zhxchen17:precompile ``` - Load using cache: ``` time buck run mode/opt scripts/zhxchen17:precompile -- --loader cache ``` Output: ``` real 0m24.593s user 0m59.342s sys 0m17.201s ``` - Load using load_fullgraph_package ``` time buck run mode/opt scripts/zhxchen17:precompile -- --loader precompile ``` Output: ``` real 0m10.907s user 0m9.210s sys 0m1.173s ``` Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_fullgraph_package_basic _function Differential Revision: D68459341
…rch#145381) Summary: In this diff we are trying to introduce some stateful API to enable "fullgraph_package" mode which will force inductor to use AOTI as a backend. Different from PR pytorch#141700, we didn't try to populate the package file into caching system, instead we bypass caching to simplify the implementation in the current form. Similar to PR pytorch#141700, I did a quick benchmark to the loading time and it looks like the following: - Precompile ``` buck run mode/opt scripts/zhxchen17:precompile ``` - Load using cache: ``` time buck run mode/opt scripts/zhxchen17:precompile -- --loader cache ``` Output: ``` real 0m24.593s user 0m59.342s sys 0m17.201s ``` - Load using load_fullgraph_package ``` time buck run mode/opt scripts/zhxchen17:precompile -- --loader precompile ``` Output: ``` real 0m10.907s user 0m9.210s sys 0m1.173s ``` Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_fullgraph_package_basic _function Differential Revision: D68459341
Summary: Design doc: https://docs.google.com/document/d/1Z15cBBPjoZ7gH00TSgCdgaYko7a7Br-ERd3_hA-g2IU/edit?usp=sharing In this diff we are trying to introduce some stateful API to enable a global mode which will force inductor to use AOTI as a backend. Different from PR #141700, we didn't try to populate the package file into caching system, instead we bypass caching to simplify the implementation in the current form. Similar to PR #141700, I did a quick benchmark to the loading time and it looks like the following: - Precompile ``` buck run mode/opt scripts/zhxchen17:precompile ``` - Load using cache: ``` time buck run mode/opt scripts/zhxchen17:precompile -- --loader cache ``` Output: ``` real 0m24.593s user 0m59.342s sys 0m17.201s ``` - Load using load_fullgraph_package ``` time buck run mode/opt scripts/zhxchen17:precompile -- --loader precompile ``` Output: ``` real 0m10.907s user 0m9.210s sys 0m1.173s ``` Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_fullgraph_package_basic _function Differential Revision: D68459341
Summary: Design doc: https://docs.google.com/document/d/1Z15cBBPjoZ7gH00TSgCdgaYko7a7Br-ERd3_hA-g2IU/edit?usp=sharing In this diff we are trying to introduce a new API pre torch.compile() object which will force inductor to use AOTI as a backend. Different from PR #141700. Similar to PR #141700, I did a quick benchmark to the loading time and it looks like the following: - Precompile ``` buck run mode/opt scripts/zhxchen17:precompile ``` - Load using cache: ``` time buck run mode/opt scripts/zhxchen17:precompile -- --loader cache ``` Output: ``` real 0m24.593s user 0m59.342s sys 0m17.201s ``` - Load using load_fullgraph_package ``` time buck run mode/opt scripts/zhxchen17:precompile -- --loader precompile ``` Output: ``` real 0m10.907s user 0m9.210s sys 0m1.173s ``` Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_fullgraph_package_basic _function Differential Revision: D68459341
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack (oldest at bottom):
Stacked on #141691
Test transcript:
tlparse miss: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpUFFDtO/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
tlparse hit: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpv0D6hF/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
For comparison, here is the test script without AOTInductor as backend:
tlparse miss: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp7rrgy0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
tlparse hit: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpf5VKld/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
Although the AOTInductor code takes longer to compile, it is 2x as fast to load on cache hit. Furthermore, because its loading is very simple, it can be directly loaded for a precompile-style workflow.
Signed-off-by: Edward Z. Yang [email protected]
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov