Skip to content

Conversation

@ezyang
Copy link
Contributor

@ezyang ezyang commented Dec 2, 2024

Stack from ghstack (oldest at bottom):

I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang [email protected]

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

[ghstack-poisoned]
@ezyang ezyang requested a review from bdhirsh as a code owner December 2, 2024 15:33
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141877

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 6fb06f0 with merge base 7c1d5db (image):

NEW FAILURE - The following job has failed:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

result._boxed_call = True
# TODO: How come cudagraphs could be None here?
# TODO: How come gm is None here?
result.post_compile(example_inputs, fx_config["cudagraphs"], None) # type: ignore[arg-type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gm is only used for @masnesral's specific freezing use case. In that case, we need the post compile gm to do things. But we don't have that for AOTAutogradCache, so AOTAutogradCache doesn't support the freezing workflow. It always passes in None here since the gm should never be needed for post compile.

(The only "gm" it has access to at this step is the post dynamo gm. So anything needed to load from FXGraphCache needs to be obtainable from that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, are we talking about the same code? I see here in post_compile:

        if cudagraphs:
            # It's possible that cudagraphs is enabled, but was disabled
            # during a previous compilation we're loading from the cache.
            # If so, we need to disable it on this new process too.
            if self.disabled_cudagraphs_reason:
                if "cuda" in self.device_types:
                    log_cudagraph_skip_and_bump_counter(
                        f"skipping cudagraphs due to {self.disabled_cudagraphs_reason}"
                    )
                else:
                    counters["inductor"]["cudagraph_skips"] += 1
                BoxedBool.disable(cudagraphs)
            else:
                cudagraph_post_compile(
                    example_inputs,
                    self,
                    cudagraphs,
                    gm,
                )

So gm is needed to do some CUDA graphs stuff. I don't see how AOTAutograd cache proves that cudagraphs could never be enabled here. I can try to write a test case that tickles it... later, maybe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, it's confusing because cudagraphs_post_compile only uses gm if freezing was enabled on the cache write, and CompiledFxGraph doesn't have the constants on hand already (which it always will in AOTAutogradCache scenarios)

IMO we should change this so the gm is not passed to cudagraphs_post_compile: only the constants themselves, and instead have whatever caller to cudagraphs_post_compile handle obtaining the constants from the gm.

The gm field is really confusing to work with because it is essentially untyped: it can represent both a dynamo gm and a post aotdispatch gm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masnesral It's really confusing what's happening here so I might refactor it some more: we shouldn't call CompoledFxGraph.get_constants() with an optional gm and change its behavior depending on some unspoken invariants. We should only call this helper function if we know freezing is on in the cache entry and we're not coming from AOTautogradcache.

The way it's written now there's no way to tell that this gm is always None from the exact code path where it's precisely correct, and a correct value otherwise

Comment on lines -313 to +436
FxGraphCache.post_compile(self, example_inputs, cudagraphs, gm)
This runs whether or not we have a cache hit, and always runs directly after we get a CompiledFxGraph.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this refactor is nice because OutputGraph is at a layer higher than a cache, so it can do the post compilation unconditionally. Without OutputCode, this refactor would be hard to write because there would be two different ways you could obtain all the needed inputs: through compile or through cache hit.

That said, maybe I'm missing context from the last stack but I don't remember how the deserialize step converts the stuff saved into the cache to an outputcode. I assume we still need something like that, to massage the cached result into the object we return from inductor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you could have always done this refactor, I just forgot to update this call site when I initially introduced post_compile. It's not doing much.

That said, maybe I'm missing context from the last stack but I don't remember how the deserialize step converts the stuff saved into the cache to an outputcode. I assume we still need something like that, to massage the cached result into the object we return from inductor.

I'm not exactly sure what your question here is. But CompiledFxGraph is directly unpickled from the cache:

        def iterate_over_candidates() -> Generator[CompiledFxGraph, None, None]:
            if local:
                subdir = FxGraphCache._get_tmp_dir_for_key(key)
                if os.path.exists(subdir):
                    for path in sorted(os.listdir(subdir)):
                        try:
                            with open(os.path.join(subdir, path), "rb") as f:
                                yield pickle.load(f)
                        except Exception:
                            log.warning(
                                "fx graph cache unable to load compiled graph",
                                exc_info=True,
                            )

And, contrary to expectation, the current_callable is directly bong'ed in by codecache.py

        try:
            with dynamo_timed(
                "PyCodeCache.load_by_key_path", log_pt2_compile_event=True
            ):
                graph.current_callable = PyCodeCache.load_by_key_path(
                    graph.cache_key,
                    artifact_path,
                    graph.cache_linemap,
                    graph.get_constants(gm),
                ).call

This part I do plan to refactor as well but I haven't gotten to it yet.

@ezyang ezyang added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Dec 2, 2024
@ezyang
Copy link
Contributor Author

ezyang commented Dec 2, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jamesjwu
Copy link
Contributor

jamesjwu commented Dec 3, 2024

Double checked that the tests that were failing are now passing, along with TestAOTAutogradWithCache which was failing in the beginning. Going to attempt one more time to land this.

python test/functorch/test_aotdispatch.py TestAOTAutogradWithCache -v
image

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 0 checks:

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: torchbench / cuda12.1-py3.10-gcc9-sm80 / test (torchbench_gcp_smoketest, 1, 1, awsa100.linux.gcp.a100)

Details for Dev Infra team Raised by workflow job

@jamesjwu
Copy link
Contributor

jamesjwu commented Dec 3, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/ezyang/3034/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/141877)

[ghstack-poisoned]
@eellison eellison removed their request for review December 3, 2024 23:06
[ghstack-poisoned]
@jamesjwu
Copy link
Contributor

jamesjwu commented Dec 4, 2024

Don't think the last error is related, so will try this one more time.

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: torchbench / cuda12.1-py3.10-gcc9-sm80 / test (torchbench_gcp_smoketest, 1, 1, awsa100.linux.gcp.a100), inductor / cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…rch#141877)

I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <[email protected]>

Pull Request resolved: pytorch#141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…te (pytorch#141877)"

This reverts commit 6153439.

Reverted pytorch#141877 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a lot of failures shows up after this lands ([comment](pytorch#141877 (comment)))
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…rch#141877)

I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <[email protected]>

Pull Request resolved: pytorch#141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel

Co-authored-by: James Wu <[email protected]>
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…te (pytorch#141877)"

This reverts commit 3ab4a28.

Reverted pytorch#141877 on behalf of https://github.com/huydhn due to Job are failing en masse after this lands, so it looks like a land race ([comment](pytorch#141877 (comment)))
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…rch#141877)

I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <[email protected]>

Pull Request resolved: pytorch#141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel

Co-authored-by: James Wu <[email protected]>
AmdSampsa pushed a commit to AmdSampsa/pytorch that referenced this pull request Dec 9, 2024
…rch#141877)

I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <[email protected]>

Pull Request resolved: pytorch#141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel

Co-authored-by: James Wu <[email protected]>
Esquains pushed a commit to Esquains/study1 that referenced this pull request Dec 15, 2024
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <[email protected]>

ghstack-source-id: 91491ea
Pull Request resolved: pytorch/pytorch#141877
Esquains pushed a commit to Esquains/study1 that referenced this pull request Dec 15, 2024
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <[email protected]>

ghstack-source-id: a59f210
Pull Request resolved: pytorch/pytorch#141877
Esquains pushed a commit to Esquains/study1 that referenced this pull request Dec 15, 2024
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <[email protected]>

ghstack-source-id: de15fd6
Pull Request resolved: pytorch/pytorch#141877
@github-actions github-actions bot deleted the gh/ezyang/3034/head branch January 4, 2025 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/torchbench ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged module: inductor Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants