-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[inductor] Add a subprocess-based parallel compile #126816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Adds a "safe" parallel compile implementation that a) Popens a sub-process with an entry point we control, and b) Uses a ProcessPoolExecutor in that sub-processes to perform parallel compiles. This change essentially squashes these two implementations from jansel, but removes the "thread-based" approach since benchmarking revealed that compile-time performance was poor compared to the existing impl: #124682 #122941 This PR adds the implementation, but defaults to the existing "fork". I'll submit a separate change to enable. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126816
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (3 Unrelated Failures)As of commit ecde0bf with merge base 0ff2f8b ( UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
torch/_inductor/codecache.py
Outdated
| if ( | ||
| os.environ.get("TORCH_TNT_IN_USE", "0") == "1" | ||
| or os.environ.get("TORCH_WARM_POOL", "1") != "1" | ||
| # TODO(masnesral): warm pool is broken, need to fix it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jansel, I commandeered this TODO you had originally, but I'm not sure there's anything to actually do here? We're calling _warm_process_pool() on the ProcessPoolExecutor owned by the subproc in SubprocMain.init(). Or did you have some other kind of warmup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was getting errors with this turned on. Does it pass for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ok. It's passing for me, but looks like something weird is happening -- extra subprocesses being spawned, probably from the subprocess itself.
torch/_inductor/codecache.py
Outdated
| if ( | ||
| os.environ.get("TORCH_TNT_IN_USE", "0") == "1" | ||
| or os.environ.get("TORCH_WARM_POOL", "1") != "1" | ||
| # TODO(masnesral): warm pool is broken, need to fix it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was getting errors with this turned on. Does it pass for you?
Summary: Adds a "safe" parallel compile implementation that a) Popens a sub-process with an entry point we control, and b) Uses a ProcessPoolExecutor in that sub-processes to perform parallel compiles. This change essentially squashes these two implementations from jansel, but removes the "thread-based" approach since benchmarking revealed that compile-time performance was poor compared to the existing impl: #124682 #122941 This PR adds the implementation, but defaults to the existing "fork". I'll submit a separate change to enable. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
| import sys | ||
| import typing | ||
|
|
||
| from torch._inductor.codecache import _set_triton_ptxas_path, caching_device_properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_set_triton_ptxas_path isn't defined in torch._inductor.codecache, do you mean the one defined in compile_tasks.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm; I guess it works because codecache.py imports that function. But yeah, will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python torch/_inductor/compile_worker/main.py doesn't seem to run for me if you just call it directly, not sure if there's some special way to call it so it passes today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm perplexed. Seems to work fine for me:
$ python torch/_inductor/compile_worker/__main__.py --help
usage: __main__.py [-h] [--workers WORKERS] [--parent PARENT]
options:
-h, --help show this help message and exit
--workers WORKERS
--parent PARENT
(but making change anyway)
| # Wrapper around ProcessPoolExecutor forks in a new process we control | ||
| pool = SubprocPool(config.compile_threads) | ||
| else: | ||
| # ensure properties have been calculated before processes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: assert that the worker_start_method is in the other possible start_methods
| # Examples: | ||
| # A simple x + x + x script: 10ms seconds in the middle of the program, 2ms at startup | ||
| # tf_efficientnet_b0 benchmark: 50ms! in the middle of the program , 3ms at startup | ||
|
|
||
| # So we want to start the workers early when it is still cheap, and also to allow the workers to get | ||
| # ready before we have work for them. | ||
|
|
||
| # ProcessPoolExecutor also does not launch the workers until it finds a point when all the workers are idle. | ||
| # But if we waited until then fork time will be long and we will be waiting for the processes to initialize. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these comments still relevant ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lemme add a TODO and I'll benchmark
Summary: Adds a "safe" parallel compile implementation that a) Popens a sub-process with an entry point we control, and b) Uses a ProcessPoolExecutor in that sub-processes to perform parallel compiles. This change essentially squashes these two implementations from jansel, but removes the "thread-based" approach since benchmarking revealed that compile-time performance was poor compared to the existing impl: #124682 #122941 This PR adds the implementation, but defaults to the existing "fork". I'll submit a separate change to enable. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
|
New benchmark result after properly warming the cache: |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: Adds a "safe" parallel compile implementation that a) Popens a sub-process with an entry point we control, and b) Uses a ProcessPoolExecutor in that sub-processes to perform parallel compiles. This change essentially squashes these two implementations from jansel, but removes the "thread-based" approach since benchmarking revealed that compile-time performance was poor compared to the existing impl: pytorch#124682 pytorch#122941 This PR adds the implementation, but defaults to the existing "fork". I'll submit a separate change to enable. Pull Request resolved: pytorch#126816 Approved by: https://github.com/jansel
|
@pytorchbot revert -m "causing internal regression" -c "ghfirst" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 126816 failedReason: Command Details for Dev Infra teamRaised by workflow job |
Stack from ghstack (oldest at bottom):
Summary:
Adds a "safe" parallel compile implementation that a) Popens a sub-process with an entry point we control, and b) Uses a ProcessPoolExecutor in that sub-processes to perform parallel compiles. This change essentially squashes these two implementations from jansel, but removes the "thread-based" approach since benchmarking revealed that compile-time performance was poor compared to the existing impl:
#124682
#122941
This PR adds the implementation, but defaults to the existing "fork". I'll submit a separate change to enable.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang