Use AOTI as inductor backend with precompile mode. #145381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

zhxchen17 wants to merge 2 commits into pytorch:main from zhxchen17:export-D68459341

Contributor

zhxchen17 commented Jan 22, 2025 •

edited

Loading

Summary:
Design doc: https://docs.google.com/document/d/1Z15cBBPjoZ7gH00TSgCdgaYko7a7Br-ERd3_hA-g2IU/edit?usp=sharing

In this diff we are trying to introduce some stateful API to enable a global mode which will force inductor to use AOTI as a backend. Different from PR #141700, we didn't try to populate the package file into caching system, instead we bypass caching to simplify the implementation in the current form.

Similar to PR #141700, I did a quick benchmark to the loading time and it looks like the following:

Precompile

buck run mode/opt scripts/zhxchen17:precompile

Load using cache:

time buck run mode/opt scripts/zhxchen17:precompile -- --loader cache

Output:

real    0m24.593s
user    0m59.342s
sys     0m17.201s

Load using load_fullgraph_package

time buck run mode/opt scripts/zhxchen17:precompile -- --loader precompile

Output:

real    0m10.907s
user    0m9.210s
sys     0m1.173s

Test Plan:
buck run mode/opt caffe2/test:test_export -- -r test_fullgraph_package_basic
_function

Differential Revision: D68459341

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

pytorch-bot bot commented Jan 22, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145381

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit fdbbd81 with merge base f951d21 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/_inductor/compile_fx.py:
pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 1, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
export/test_fullgraph_package.py::TestFullgraphPackage::test_fullgraph_package_basic_function

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added ciflow/inductor module: inductor labels

Contributor

facebook-github-bot commented Jan 22, 2025

This pull request was exported from Phabricator. Differential Revision: D68459341

facebook-github-bot added the fb-exported label

zhxchen17 force-pushed the export-D68459341 branch from c99eac4 to 4b5e128 Compare

January 22, 2025 16:41

Contributor

facebook-github-bot commented Jan 22, 2025

This pull request was exported from Phabricator. Differential Revision: D68459341

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 22, 2025 17:12

Inactive

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 22, 2025 17:12

Inactive

Contributor

facebook-github-bot commented Jan 22, 2025

This pull request was exported from Phabricator. Differential Revision: D68459341

zhxchen17 force-pushed the export-D68459341 branch from 4b5e128 to 4a4cada Compare

January 22, 2025 17:52

zhxchen17 added a commit to zhxchen17/pytorch that referenced this pull request


          Use AOTI as inductor backend when fullgraph_package is enabled. (pyto…

4a4cada

…rch#145381)

Summary:
Pull Request resolved: pytorch#145381

In this diff we are trying to introduce some stateful API to enable "fullgraph_package" mode which will force inductor to use AOTI as a backend. Different from PR pytorch#141700, we didn't try to populate the package file into caching system, instead we bypass caching to simplify the implementation in the current form.

Similar to PR pytorch#141700, I did a quick benchmark to the loading time and it looks like the following:
- Precompile
```
buck run mode/opt scripts/zhxchen17:precompile
```
- Load using cache:
```
time buck run mode/opt scripts/zhxchen17:precompile -- --loader cache
```
Output:
```
real    0m24.593s
user    0m59.342s
sys     0m17.201s
```
- Load using load_fullgraph_package
```
time buck run mode/opt scripts/zhxchen17:precompile -- --loader precompile
```
Output:
```
real    0m10.907s
user    0m9.210s
sys     0m1.173s
```

Test Plan:
buck run mode/opt caffe2/test:test_export -- -r test_fullgraph_package_basic
_function

Differential Revision: D68459341

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 22, 2025 18:32

Inactive

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 22, 2025 18:32

Inactive

zhxchen17 requested a review from ezyang

January 22, 2025 19:03

zhxchen17 added the topic: not user facing label

zhxchen17 force-pushed the export-D68459341 branch from 4a4cada to e282a02 Compare

January 22, 2025 22:04

zhxchen17 added a commit to zhxchen17/pytorch that referenced this pull request


          Use AOTI as inductor backend when fullgraph_package is enabled. (pyto…

e282a02

…rch#145381)

Summary:

In this diff we are trying to introduce some stateful API to enable "fullgraph_package" mode which will force inductor to use AOTI as a backend. Different from PR pytorch#141700, we didn't try to populate the package file into caching system, instead we bypass caching to simplify the implementation in the current form.

Similar to PR pytorch#141700, I did a quick benchmark to the loading time and it looks like the following:
- Precompile
```
buck run mode/opt scripts/zhxchen17:precompile
```
- Load using cache:
```
time buck run mode/opt scripts/zhxchen17:precompile -- --loader cache
```
Output:
```
real    0m24.593s
user    0m59.342s
sys     0m17.201s
```
- Load using load_fullgraph_package
```
time buck run mode/opt scripts/zhxchen17:precompile -- --loader precompile
```
Output:
```
real    0m10.907s
user    0m9.210s
sys     0m1.173s
```

Test Plan:
buck run mode/opt caffe2/test:test_export -- -r test_fullgraph_package_basic
_function

Differential Revision: D68459341

Contributor

facebook-github-bot commented Jan 22, 2025

This pull request was exported from Phabricator. Differential Revision: D68459341

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 22, 2025 23:38

Inactive

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 22, 2025 23:38

Inactive

zhxchen17 requested review from anijain2305 and desertfire

January 23, 2025 16:23

zhxchen17 force-pushed the export-D68459341 branch from e282a02 to cca48a8 Compare

January 23, 2025 16:46

zhxchen17 added a commit to zhxchen17/pytorch that referenced this pull request


          Use AOTI as inductor backend when fullgraph_package is enabled. (pyto…

cca48a8

…rch#145381)

Summary:

In this diff we are trying to introduce some stateful API to enable "fullgraph_package" mode which will force inductor to use AOTI as a backend. Different from PR pytorch#141700, we didn't try to populate the package file into caching system, instead we bypass caching to simplify the implementation in the current form.

Similar to PR pytorch#141700, I did a quick benchmark to the loading time and it looks like the following:
- Precompile
```
buck run mode/opt scripts/zhxchen17:precompile
```
- Load using cache:
```
time buck run mode/opt scripts/zhxchen17:precompile -- --loader cache
```
Output:
```
real    0m24.593s
user    0m59.342s
sys     0m17.201s
```
- Load using load_fullgraph_package
```
time buck run mode/opt scripts/zhxchen17:precompile -- --loader precompile
```
Output:
```
real    0m10.907s
user    0m9.210s
sys     0m1.173s
```

Test Plan:
buck run mode/opt caffe2/test:test_export -- -r test_fullgraph_package_basic
_function

Differential Revision: D68459341

Contributor

facebook-github-bot commented Jan 23, 2025

This pull request was exported from Phabricator. Differential Revision: D68459341

zhxchen17 requested review from jamesjwu and removed request for anijain2305

January 23, 2025 16:55

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 23, 2025 17:17

Inactive

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 24, 2025 22:10

Inactive


          Merge branch 'main' into export-D68459341

fdbbd81

jansel requested changes

View reviewed changes

test/export/test_fullgraph_package.py

Comment on lines +12 to +21

    
                  def setUp(self):

                      if not os.path.exists(os.path.expandvars("/tmp/torchinductor_$USER/")):

                          os.makedirs(os.path.expandvars("/tmp/torchinductor_$USER/"))

                  def tearDown(self):

                      super().tearDown()

                      pathlib.Path(self.path()).unlink(missing_ok=True)

                  def path(self):

                      return os.path.expandvars(f"/tmp/torchinductor_$USER/model_{self.id()}.pt2")

Contributor

jansel Jan 27, 2025

Use existing helper for cache dir:

pytorch/torch/_inductor/runtime/cache_dir_utils.py

Lines 10 to 15 in f951d21

    
           def cache_dir() -> str: 
        
               cache_dir = os.environ.get("TORCHINDUCTOR_CACHE_DIR") 
        
               if cache_dir is None: 
        
                   os.environ["TORCHINDUCTOR_CACHE_DIR"] = cache_dir = default_cache_dir() 
        
               os.makedirs(cache_dir, exist_ok=True) 
        
               return cache_dir

There is also an inductor-specific TestCase base class that sets the cache dir to a temporary place with automatic cleanup.

test/export/test_fullgraph_package.py

    
                      return os.path.expandvars(f"/tmp/torchinductor_$USER/model_{self.id()}.pt2")

                  @unittest.skipIf(not TEST_CUDA, "requires cuda")

                  def test_fullgraph_package_basic_function(self):

Contributor

jansel Jan 27, 2025

Add some additional test cases. Test on CPU. Test training. Test errors (like wrong shapes passed). Etc.

torch/__init__.py

    
                              mode=mode,

                              options=options,

                              disable=disable,

                              name=name,

Contributor

jansel Jan 27, 2025

Why is the name needed? It seems a bit clunky to specify both a path and a name.

torch/_fullgraph.py

    
                  def __init__(

                      self,

                      *,

                      path: Optional[str] = None,

Contributor

jansel Jan 27, 2025

Should this be a required arg? If the user doesn't specify it, the semantics of a default path seem odd. Similar APIs like model.save() don't have a default path.

torch/__init__.py

    
                  if backend == "inductor":

                      backend = _TorchCompileInductorWrapper(mode, options, dynamic)

                      backend = _TorchCompileInductorWrapper(mode, options, dynamic, fullgraph, name)

Contributor

jansel Jan 27, 2025

Why does inductor need to know about fullgraph mode?

torch/_inductor/compile_fx.py

    
                      )

              _PRECOMPILES: Dict[str, List[Any]] = {}

Contributor

jansel Jan 27, 2025

Not thread safe. I'd think we should be able to eliminate the name and store this on the torch.compile object (just need to thread a pointer to that object down into this function).

torch/_inductor/compile_fx.py

Comment on lines +657 to +658

    
                  if precompile := _get_precompile(graph_kwargs.get("name"), example_inputs):

                      return precompile

Contributor

jansel Jan 27, 2025

This seems like an odd place to check the cache. By this point we have already run dynamo and AOT Autograd, which incur a lot of compile time -- but then we just throw out the graph we worked so hard to generate. If we moved this check up to the object returned by torch.compile then we could get the compile time down close to zero.

torch/_inductor/compile_fx.py

    
                                                      current_callable, 1, graph.device_type

                                                  ).run  # type: ignore[attr-defined]

                                              )

                                          elif graph.device_type == "cpu":

Contributor

jansel Jan 27, 2025

Not tested?

torch/_inductor/compile_fx.py

    
                                                      current_callable, 1

                                                  ).run  # type: ignore[attr-defined]

                                              )

                                          elif graph.device_type == "xpu":

Contributor

jansel Jan 27, 2025

Not tested?

torch/_inductor/compile_fx.py

    
                                          else:

                                              current_callable = compiled_fn

                                          if graph.device_type.startswith("cuda"):

Contributor

jansel Jan 27, 2025

What about a graph with both CPU and CUDA?

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 27, 2025 20:11

Inactive

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 27, 2025 20:11

Inactive

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 27, 2025 20:11

Inactive

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 27, 2025 20:11

Inactive

pytorch-bot bot temporarily deployed to upload-benchmark-results

January 27, 2025 20:11

Inactive

ezyang reviewed

View reviewed changes

torch/compiler/__init__.py

    
              #       explicitly package precompiled artifacts into a single file.

              #       TODO Eventually we should come up with a context manager style API. To

              #       reduce the complexity of landing changes, we first introduce a set of

              #       stateful interfaces as the future building blocks to begin with.

Contributor

ezyang Jan 28, 2025

This is going to be intended to be a real API right? Do you think it's too early to put our best foot forward on the API (which includes not having underscores?)

zhxchen17 added a commit that referenced this pull request


          Use AOTI as inductor backend with precompile mode. (#145381)

764b6a9

Summary:

Design doc: https://docs.google.com/document/d/1Z15cBBPjoZ7gH00TSgCdgaYko7a7Br-ERd3_hA-g2IU/edit?usp=sharing

In this diff we are trying to introduce a new API pre torch.compile() object which will force inductor to use AOTI as a backend. Different from PR #141700.

Similar to PR #141700, I did a quick benchmark to the loading time and it looks like the following:
- Precompile
```
buck run mode/opt scripts/zhxchen17:precompile
```
- Load using cache:
```
time buck run mode/opt scripts/zhxchen17:precompile -- --loader cache
```
Output:
```
real    0m24.593s
user    0m59.342s
sys     0m17.201s
```
- Load using load_fullgraph_package
```
time buck run mode/opt scripts/zhxchen17:precompile -- --loader precompile
```
Output:
```
real    0m10.907s
user    0m9.210s
sys     0m1.173s
```

Test Plan:
buck run mode/opt caffe2/test:test_export -- -r test_fullgraph_package_basic
_function

Differential Revision: D68459341

zhxchen17 added a commit that referenced this pull request


          Sticky cache API for torch.compile

48dfe0c

Following up PR #145381, we implement
a new API for compiling fullgraph models using the cpp wrapper, and save/load
compiled artifacts to disk.

Sticky cache is now designed to be a per-torch.compile() object living with
the compilation context. Each time a recompilation happens, it will collect
the compiled artifacts into a lookup table. When a new set of inputs is
passed to the compiled callable, before we enter the dynamo cache, we will
perform a lookup in sticky cache first, and match by the guards on inputs
only.

API names are tentative but the workflow roughly looks like the following:

```
def f(...): ...

compiled_f = torch.compile(f, fullgraph=True, sticky_cache="my_dir/my_model")

compiled_f(*args)

compiled_f.save_sticky_cache(prefix="/dir1")

...

compiled_f.load_sticky_cache(prefix="/dir2")
```

Since this is touching many layers of the torch.compile system, we start from the
simple case of forward only graph, static shape and flat tensor inputs/outputs.
Once the overall API converges, we can gradually remove the sticky_cache.unimplemented()
calls from the code.

zhxchen17 mentioned this pull request

Package API for torch.compile #147528

Closed

zhxchen17 added a commit that referenced this pull request


          Sticky cache API for torch.compile

bed234e

Following up PR #145381, we implement
a new API for compiling fullgraph models using the cpp wrapper, and save/load
compiled artifacts to disk.

Sticky cache is now designed to be a per-torch.compile() object living with
the compilation context. Each time a recompilation happens, it will collect
the compiled artifacts into a lookup table. When a new set of inputs is
passed to the compiled callable, before we enter the dynamo cache, we will
perform a lookup in sticky cache first, and match by the guards on inputs
only.

API names are tentative but the workflow roughly looks like the following:

```
def f(...): ...

compiled_f = torch.compile(f, fullgraph=True, sticky_cache="my_dir/my_model")

compiled_f(*args)

compiled_f.save_sticky_cache(prefix="/dir1")

...

compiled_f.load_sticky_cache(prefix="/dir2")
```

Since this is touching many layers of the torch.compile system, we start from the
simple case of forward only graph, static shape and flat tensor inputs/outputs.
Once the overall API converges, we can gradually remove the sticky_cache.unimplemented()
calls from the code.

Contributor Author

zhxchen17 commented Feb 22, 2025

Continue in #147528

zhxchen17 closed this

zhxchen17 added a commit that referenced this pull request


          Sticky cache API for torch.compile

44ad78e

Following up PR #145381, we implement
a new API for compiling fullgraph models using the cpp wrapper, and save/load
compiled artifacts to disk.

Sticky cache is now designed to be a per-torch.compile() object living with
the compilation context. Each time a recompilation happens, it will collect
the compiled artifacts into a lookup table. When a new set of inputs is
passed to the compiled callable, before we enter the dynamo cache, we will
perform a lookup in sticky cache first, and match by the guards on inputs
only.

API names are tentative but the workflow roughly looks like the following:

```
def f(...): ...

compiled_f = torch.compile(f, fullgraph=True, sticky_cache="my_dir/my_model")

compiled_f(*args)

compiled_f.save_sticky_cache(prefix="/dir1")

...

compiled_f.load_sticky_cache(prefix="/dir2")
```

Since this is touching many layers of the torch.compile system, we start from the
simple case of forward only graph, static shape and flat tensor inputs/outputs.
Once the overall API converges, we can gradually remove the sticky_cache.unimplemented()
calls from the code.

zhxchen17 added a commit that referenced this pull request


          Sticky cache API for torch.compile

697d678

Following up PR #145381, we implement
a new API for compiling fullgraph models using the cpp wrapper, and save/load
compiled artifacts to disk.

Sticky cache is now designed to be a per-torch.compile() object living with
the compilation context. Each time a recompilation happens, it will collect
the compiled artifacts into a lookup table. When a new set of inputs is
passed to the compiled callable, before we enter the dynamo cache, we will
perform a lookup in sticky cache first, and match by the guards on inputs
only.

API names are tentative but the workflow roughly looks like the following:

```
def f(...): ...

compiled_f = torch.compile(f, fullgraph=True, sticky_cache="my_dir/my_model")

compiled_f(*args)

compiled_f.save_sticky_cache(prefix="/dir1")

...

compiled_f.load_sticky_cache(prefix="/dir2")
```

Since this is touching many layers of the torch.compile system, we start from the
simple case of forward only graph, static shape and flat tensor inputs/outputs.
Once the overall API converges, we can gradually remove the sticky_cache.unimplemented()
calls from the code.

zhxchen17 added a commit that referenced this pull request


          Sticky cache API for torch.compile

fc4aff3

Following up PR #145381, we implement
a new API for compiling fullgraph models using the cpp wrapper, and save/load
compiled artifacts to disk.

Sticky cache is now designed to be a per-torch.compile() object living with
the compilation context. Each time a recompilation happens, it will collect
the compiled artifacts into a lookup table. When a new set of inputs is
passed to the compiled callable, before we enter the dynamo cache, we will
perform a lookup in sticky cache first, and match by the guards on inputs
only.

API names are tentative but the workflow roughly looks like the following:

```
def f(...): ...

compiled_f = torch.compile(f, fullgraph=True, sticky_cache="my_dir/my_model")

compiled_f(*args)

compiled_f.save_sticky_cache(prefix="/dir1")

...

compiled_f.load_sticky_cache(prefix="/dir2")
```

Since this is touching many layers of the torch.compile system, we start from the
simple case of forward only graph, static shape and flat tensor inputs/outputs.
Once the overall API converges, we can gradually remove the sticky_cache.unimplemented()
calls from the code.

zhxchen17 added a commit that referenced this pull request


          Package API for torch.compile

eb8b103

Following up PR #145381, we implement
a new API for compiling models using the cpp wrapper, and save/load
compiled artifacts to disk.

Package is now designed to be a per-torch.compile() object living with
the compilation context. Each time a recompilation happens, it will collect
the compiled artifacts into a lookup table. When a new set of inputs is
passed to the compiled callable, before we enter the dynamo cache, we will
perform a lookup in compile package first, and match by the serialized guards.

API names are tentative but the workflow roughly looks like the following:

```
def f(...): ...

compiled_f = torch.compile(f, package="my_dir/my_model")

compiled_f(*args)

compiled_f.save_package(prefix="/dir1")

...

compiled_f.load_package(prefix="/dir2")
```

zhxchen17 added a commit that referenced this pull request


          Package API for torch.compile

54bc977

Following up PR #145381, we implement
a new API for compiling models using the cpp wrapper, and save/load
compiled artifacts to disk.

Package is now designed to be a per-torch.compile() object living with
the compilation context. Each time a recompilation happens, it will collect
the compiled artifacts into a lookup table. When a new set of inputs is
passed to the compiled callable, before we enter the dynamo cache, we will
perform a lookup in compile package first, and match by the serialized guards.

API names are tentative but the workflow roughly looks like the following:

```
def f(...): ...

compiled_f = torch.compile(f, package="my_dir/my_model")

compiled_f(*args)

compiled_f.save_package(prefix="/dir1")

...

compiled_f.load_package(prefix="/dir2")
```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

ezyang ezyang left review comments

jansel jansel requested changes

desertfire Awaiting requested review from desertfire

oulgen Awaiting requested review from oulgen

+1 more reviewer

jamesjwu jamesjwu left review comments

Labels

ciflow/inductor ciflow/trunk fb-exported module: inductor topic: not user facing