-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[REFACTOR] Inline FxGraphCache.post_compile into sole call site #141877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141877
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit 6fb06f0 with merge base 7c1d5db ( NEW FAILURE - The following job has failed:
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| result._boxed_call = True | ||
| # TODO: How come cudagraphs could be None here? | ||
| # TODO: How come gm is None here? | ||
| result.post_compile(example_inputs, fx_config["cudagraphs"], None) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gm is only used for @masnesral's specific freezing use case. In that case, we need the post compile gm to do things. But we don't have that for AOTAutogradCache, so AOTAutogradCache doesn't support the freezing workflow. It always passes in None here since the gm should never be needed for post compile.
(The only "gm" it has access to at this step is the post dynamo gm. So anything needed to load from FXGraphCache needs to be obtainable from that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, are we talking about the same code? I see here in post_compile:
if cudagraphs:
# It's possible that cudagraphs is enabled, but was disabled
# during a previous compilation we're loading from the cache.
# If so, we need to disable it on this new process too.
if self.disabled_cudagraphs_reason:
if "cuda" in self.device_types:
log_cudagraph_skip_and_bump_counter(
f"skipping cudagraphs due to {self.disabled_cudagraphs_reason}"
)
else:
counters["inductor"]["cudagraph_skips"] += 1
BoxedBool.disable(cudagraphs)
else:
cudagraph_post_compile(
example_inputs,
self,
cudagraphs,
gm,
)
So gm is needed to do some CUDA graphs stuff. I don't see how AOTAutograd cache proves that cudagraphs could never be enabled here. I can try to write a test case that tickles it... later, maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, it's confusing because cudagraphs_post_compile only uses gm if freezing was enabled on the cache write, and CompiledFxGraph doesn't have the constants on hand already (which it always will in AOTAutogradCache scenarios)
IMO we should change this so the gm is not passed to cudagraphs_post_compile: only the constants themselves, and instead have whatever caller to cudagraphs_post_compile handle obtaining the constants from the gm.
The gm field is really confusing to work with because it is essentially untyped: it can represent both a dynamo gm and a post aotdispatch gm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masnesral It's really confusing what's happening here so I might refactor it some more: we shouldn't call CompoledFxGraph.get_constants() with an optional gm and change its behavior depending on some unspoken invariants. We should only call this helper function if we know freezing is on in the cache entry and we're not coming from AOTautogradcache.
The way it's written now there's no way to tell that this gm is always None from the exact code path where it's precisely correct, and a correct value otherwise
| FxGraphCache.post_compile(self, example_inputs, cudagraphs, gm) | ||
| This runs whether or not we have a cache hit, and always runs directly after we get a CompiledFxGraph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this refactor is nice because OutputGraph is at a layer higher than a cache, so it can do the post compilation unconditionally. Without OutputCode, this refactor would be hard to write because there would be two different ways you could obtain all the needed inputs: through compile or through cache hit.
That said, maybe I'm missing context from the last stack but I don't remember how the deserialize step converts the stuff saved into the cache to an outputcode. I assume we still need something like that, to massage the cached result into the object we return from inductor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you could have always done this refactor, I just forgot to update this call site when I initially introduced post_compile. It's not doing much.
That said, maybe I'm missing context from the last stack but I don't remember how the deserialize step converts the stuff saved into the cache to an outputcode. I assume we still need something like that, to massage the cached result into the object we return from inductor.
I'm not exactly sure what your question here is. But CompiledFxGraph is directly unpickled from the cache:
def iterate_over_candidates() -> Generator[CompiledFxGraph, None, None]:
if local:
subdir = FxGraphCache._get_tmp_dir_for_key(key)
if os.path.exists(subdir):
for path in sorted(os.listdir(subdir)):
try:
with open(os.path.join(subdir, path), "rb") as f:
yield pickle.load(f)
except Exception:
log.warning(
"fx graph cache unable to load compiled graph",
exc_info=True,
)
And, contrary to expectation, the current_callable is directly bong'ed in by codecache.py
try:
with dynamo_timed(
"PyCodeCache.load_by_key_path", log_pt2_compile_event=True
):
graph.current_callable = PyCodeCache.load_by_key_path(
graph.cache_key,
artifact_path,
graph.cache_linemap,
graph.get_constants(gm),
).call
This part I do plan to refactor as well but I haven't gotten to it yet.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…se class (#141878) Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: #141878 Approved by: https://github.com/jamesjwu ghstack dependencies: #141877
|
Double checked that the tests that were failing are now passing, along with TestAOTAutogradWithCache which was failing in the beginning. Going to attempt one more time to land this.
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 0 checks: Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: torchbench / cuda12.1-py3.10-gcc9-sm80 / test (torchbench_gcp_smoketest, 1, 1, awsa100.linux.gcp.a100) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
|
Don't think the last error is related, so will try this one more time. @pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 2 checks: torchbench / cuda12.1-py3.10-gcc9-sm80 / test (torchbench_gcp_smoketest, 1, 1, awsa100.linux.gcp.a100), inductor / cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…rch#141877) I am going to break apart the arguments passed to the constituents to only pass exactly what is needed, so easy access to the insides is helpful here. This also moves two helper functions to output_code.py as well. Also set _boxed_call at constructor. Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#141877 Approved by: https://github.com/jamesjwu, https://github.com/jansel
…se class (pytorch#141878) Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#141878 Approved by: https://github.com/jamesjwu ghstack dependencies: pytorch#141877
…te (pytorch#141877)" This reverts commit 6153439. Reverted pytorch#141877 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a lot of failures shows up after this lands ([comment](pytorch#141877 (comment)))
…rch#141877) I am going to break apart the arguments passed to the constituents to only pass exactly what is needed, so easy access to the insides is helpful here. This also moves two helper functions to output_code.py as well. Also set _boxed_call at constructor. Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#141877 Approved by: https://github.com/jamesjwu, https://github.com/jansel Co-authored-by: James Wu <[email protected]>
…te (pytorch#141877)" This reverts commit 3ab4a28. Reverted pytorch#141877 on behalf of https://github.com/huydhn due to Job are failing en masse after this lands, so it looks like a land race ([comment](pytorch#141877 (comment)))
…rch#141877) I am going to break apart the arguments passed to the constituents to only pass exactly what is needed, so easy access to the insides is helpful here. This also moves two helper functions to output_code.py as well. Also set _boxed_call at constructor. Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#141877 Approved by: https://github.com/jamesjwu, https://github.com/jansel Co-authored-by: James Wu <[email protected]>
…rch#141877) I am going to break apart the arguments passed to the constituents to only pass exactly what is needed, so easy access to the insides is helpful here. This also moves two helper functions to output_code.py as well. Also set _boxed_call at constructor. Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#141877 Approved by: https://github.com/jamesjwu, https://github.com/jansel Co-authored-by: James Wu <[email protected]>
I am going to break apart the arguments passed to the constituents to only pass exactly what is needed, so easy access to the insides is helpful here. This also moves two helper functions to output_code.py as well. Also set _boxed_call at constructor. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: 91491ea Pull Request resolved: pytorch/pytorch#141877
I am going to break apart the arguments passed to the constituents to only pass exactly what is needed, so easy access to the insides is helpful here. This also moves two helper functions to output_code.py as well. Also set _boxed_call at constructor. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: a59f210 Pull Request resolved: pytorch/pytorch#141877
I am going to break apart the arguments passed to the constituents to only pass exactly what is needed, so easy access to the insides is helpful here. This also moves two helper functions to output_code.py as well. Also set _boxed_call at constructor. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: de15fd6 Pull Request resolved: pytorch/pytorch#141877

Stack from ghstack (oldest at bottom):
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.
This also moves two helper functions to output_code.py as well.
Also set _boxed_call at constructor.
Signed-off-by: Edward Z. Yang [email protected]
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov