Package API for torch.compile #147528

zhxchen17 · 2025-02-20T17:32:07Z

Package API for torch.compile

Following up PR #145381, we implement
a new API for compiling models using the cpp wrapper, and save/load
compiled artifacts to disk.

Package is now designed to be a per-torch.compile() object living with
the compilation context. Each time a recompilation happens, it will collect
the compiled artifacts into a lookup table. When a new set of inputs is
passed to the compiled callable, before we enter the dynamo cache, we will
perform a lookup in compile package first, and match by the serialized guards.

API names are tentative but the workflow roughly looks like the following:

def f(...): ...

compiled_f = torch.compile(f, package="my_dir/my_model")

compiled_f(*args)

compiled_f.save_package(prefix="/dir1")

...

compiled_f.load_package(prefix="/dir2")

Fixes #ISSUE_NUMBER

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-02-20T17:32:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147528

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures, 3 Unrelated Failures

As of commit 54bc977 with merge base ae29f05 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/_dynamo/symbolic_convert.py:
pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 1, 5, lf.ephemeral.linux.4xlarge.nvidia.gpu) (gh)
test_compile_package.py::TestCompilePackage::test_basic_function_container_inputs_device_cuda
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 1, 3, lf.ephemeral.linux.2xlarge) (gh)
test_compile_package.py::TestCompilePackage::test_basic_function_tensor_meta_mismatch
pull / linux-focal-py3.9-clang10 / test (crossref, 1, 2, lf.ephemeral.linux.2xlarge) (gh)
test_compile_package.py::TestCompilePackage::test_basic_class_method
pull / linux-focal-py3.9-clang10 / test (default, 1, 5, lf.ephemeral.linux.4xlarge) (gh)
test_compile_package.py::TestCompilePackage::test_basic_class_method
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 1, 3, lf.ephemeral.linux.2xlarge) (gh)
test_compile_package.py::TestCompilePackage::test_basic_class_method
pull / linux-jammy-py3.9-gcc11 / test (backwards_compat, 1, 1, lf.ephemeral.linux.2xlarge) (gh)
test_modules_can_be_imported
pull / linux-jammy-py3.9-gcc11 / test (default, 1, 5, lf.ephemeral.linux.2xlarge) (gh)
test_compile_package.py::TestCompilePackage::test_basic_class_method

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_torchbench, 1, 2, linux.8xlarge.amx) (gh) (similar failure)
detectron2_fcos_r_50_fpn

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (#149370)
REGRESSION: benchmark ('aotdispatcher_training_nosubclass_cpu', 'compile_time_instruction_count') failed, actual result 3641106522 is 1.74% higher than expected 3579000000 ±+1.50% if this is an expected regression, please update the expected results.
pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.ephemeral.linux.2xlarge) (gh) (#144480)
backends/xnnpack/test/passes/test_convert_to_linear.py::TestConvertToLinear::test_fp32_convert_to_linear

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/__init__.py

test/test_sticky_cache.py

torch/__init__.py

torch/_dynamo/sticky_cache.py

jansel · 2025-02-21T23:08:24Z

torch/_dynamo/sticky_cache.py

+        name = next(
+            n for n in self.dynamo_code.co_names if n.startswith("__compiled_fn")
+        )
+        return types.FunctionType(self.dynamo_code, globals={name: self.aoti})(
+            *args, **kwargs
+        )


This seems wrong.

Dynamo code might depend on other globals in addition to __compiled_fn0, both in user code and in generated code. I think we need to examine the co_names of the code object.

This will also be incorrect if you compile two functions in the same file since the second will be __compiled_fn1.

This will also be incorrect if you compile two functions in the same file since the second will be __compiled_fn1

I'm not sure I follow this. If we fullgraph compile, shouldn't there be only 1 compiled fn mapped to dynamo code? If we mean globally here, then I expect each compiled object only reference 1 compiled function starts with "__compiled_fn", that's why we filter the co_names from line 93.

I think we need to examine the co_names of the code object.

What do you mean by "examine the co_names"? Should we raise an error if we see extra global names in co_names?

I guess an example would be helpful and will be happy to add it to test.

You can fullgraph compile two different functions in the same file (or two different shapes for the same functions).

For other globals, put a tensor in global scope and read from it.

oulgen · 2025-02-22T17:06:30Z

@zhxchen17 @jansel I have a bit of a meta point, can we NOT call the public API a cache? We have term overload where people tend to come back saying "cache is not working", but there's always ambiguity, is it the the dynamo recompilation cache? Is it inductor/aotautograd cache? and now is it the sticky cache?

we already have torch.compiler.load_cache_artifacts and torch.compiler.save_cache_artifacts which makes this further confusing.

I feel like there will be a good value in clear disambiguation.

zhxchen17 · 2025-02-23T17:14:57Z

@zhxchen17 @jansel I have a bit of a meta point, can we NOT call the public API a cache? We have term overload where people tend to come back saying "cache is not working", but there's always ambiguity, is it the the dynamo recompilation cache? Is it inductor/aotautograd cache? and now is it the sticky cache?

we already have torch.compiler.load_cache_artifacts and torch.compiler.save_cache_artifacts which makes this further confusing.

I feel like there will be a good value in clear disambiguation.

@oulgen I got your point. Do you like the naming "persistent_artifacts" better than sticky_cache? I can switch name to avoid the confusion here.

oulgen · 2025-02-27T19:56:09Z

@zhxchen17 @jansel I have a bit of a meta point, can we NOT call the public API a cache? We have term overload where people tend to come back saying "cache is not working", but there's always ambiguity, is it the the dynamo recompilation cache? Is it inductor/aotautograd cache? and now is it the sticky cache?
we already have torch.compiler.load_cache_artifacts and torch.compiler.save_cache_artifacts which makes this further confusing.
I feel like there will be a good value in clear disambiguation.

@oulgen I got your point. Do you like the naming "persistent_artifacts" better than sticky_cache? I can switch name to avoid the confusion here.

yes, that sounds a lot better, thanks

bdhirsh · 2025-02-28T23:06:38Z

torch/_functorch/aot_autograd.py


+        if aot_config.sticky_cache is not None:
+            if any(
+                info.mutation_type != MutationType.NOT_MUTATED


let me know if you want more review on the AOTAutograd bits (we might want to think more about how this interacts with the existing AOT warm cache today).

One comment is that input mutations are probably fine for sticky cache, as long as the input mutation is captured inside of the graph (the "bad" case is if the input mutation is forced to run outside of the graph, in an opaque AOTAutograd epilogue). Almost all input mutations at this point are captured in-graph today

oulgen · 2025-03-27T17:30:21Z

@zhxchen17 I think it might be good to discuss the @torch.compile(package="/package/path"), I suspect having the package name on the compile API is OK for most OSS jobs, but I suspect will make it difficult to apply it on mast jobs, so it might be worthwhile moving this to load/save_package calls

jamesjwu

Really cool! Some initial questions about code organization, at least in the higher level sections I understand haha

jamesjwu · 2025-03-27T17:23:24Z