[dynamo] "skip_guard_eval_unsafe" API for power users #139038

anijain2305 · 2024-10-28T05:08:49Z

Stack from ghstack (oldest at bottom):

Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0".

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

Evaluation

I tried the hqq model in #138386. With very small changes in the user code (hqq PR), I see the throughput increase from 160 tokens/sec to 174 tokens/sec.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @rec

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% ovehread is bad. And recently we saw this in hqq (low precision LLM generation) - #138386. To put some numbers for perspecitve, low precision LLM infernece reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total. Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead" A must-have consideration is to support fast inference where the model has recompiled (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run. What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set. If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the tree node to increase the weightage of that node. If we just want to run the "guard diff set", we just have to run only those nodes in the tree which have fail_count > 0. This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error). I tried the hqq model in #138386. With very small changes in the user code (share PR), I see the throughput increase from 160 tokens/sec to 174 tokens/sec. [ghstack-poisoned]

pytorch-bot · 2024-10-28T05:08:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139038

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures

As of commit 9aa56c7 with merge base 740054f ():

NEW FAILURES - The following jobs have failed:

pull / linux-docs / build-docs-python-false (gh)
/opt/conda/envs/py_3.9/lib/python3.9/site-packages/sphinx/ext/coverage.py:166: FutureWarning: torch.distributed.reduce_opis deprecated, please usetorch.distributed.ReduceOp instead
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 1, 5, linux.4xlarge.nvidia.gpu) (gh)
dynamo/test_skip_guard_eval_unsafe.py::RunDiffGuardTests::test_cache_line_pickup
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
dynamo/test_skip_guard_eval_unsafe.py::RunDiffGuardTests::test_cache_line_pickup
pull / linux-focal-py3.11-clang10 / test (default, 1, 5, linux.4xlarge) (gh)
dynamo/test_skip_guard_eval_unsafe.py::RunDiffGuardTests::test_cache_line_pickup
pull / linux-focal-py3.12-clang10 / test (default, 1, 5, linux.4xlarge) (gh)
dynamo/test_skip_guard_eval_unsafe.py::RunDiffGuardTests::test_cache_line_pickup
pull / linux-focal-py3.9-clang10 / test (default, 1, 5, linux.4xlarge) (gh)
dynamo/test_skip_guard_eval_unsafe.py::RunDiffGuardTests::test_cache_line_pickup
pull / linux-jammy-py3.10-clang15-asan / test (default, 1, 6, linux.4xlarge) (gh)
dynamo/test_skip_guard_eval_unsafe.py::RunDiffGuardTests::test_cache_line_pickup
pull / linux-jammy-py3.9-gcc11 / test (default, 1, 5, linux.2xlarge) (gh)
dynamo/test_skip_guard_eval_unsafe.py::RunDiffGuardTests::test_cache_line_pickup
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 1, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh)
dynamo/test_skip_guard_eval_unsafe.py::RunDiffGuardTests::test_cache_line_pickup
trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh)
dynamo/test_skip_guard_eval_unsafe.py::RunDiffGuardTests::test_cache_line_pickup

This comment was automatically generated by Dr. CI and updates every 15 minutes.

# Motivation We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total. Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead" # Design A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run. What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run. When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set. If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here). A bonus point here is that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this. # Evaluation I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% ovehread is bad. And recently we saw this in hqq (low precision LLM generation) - #138386. To put some numbers for perspecitve, low precision LLM infernece reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total. Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead" A must-have consideration is to support fast inference where the model has recompiled (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run. What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set. If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the tree node to increase the weightage of that node. If we just want to run the "guard diff set", we just have to run only those nodes in the tree which have fail_count > 0. This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error). I tried the hqq model in #138386. With very small changes in the user code (share PR), I see the throughput increase from 160 tokens/sec to 174 tokens/sec. ghstack-source-id: cdc1220 Pull Request resolved: #139038

ezyang · 2024-10-28T13:31:52Z

I'm positive on this. We should log entrance to this stance to tlparse so we can easily tell if someone is in this mode for debugging purposes. I think hard error on miss is a must. I might suggest a softer version of this that still does dynamic shapes guards (or in general, guards that are "likely" to catch correctness problems).

I'm trying to remember who was most against the sticky cache concept. Maybe @gchanan but he's not available right now. I would show it to that person to get the stiffest feedback.

anijain2305 · 2024-10-28T17:21:46Z

I might suggest a softer version of this that still does dynamic shapes guards (or in general, guards that are "likely" to catch correctness problems).

Sounds good, this is quite easy to do today. Dynamic shape guards are sitting on the RootGuardManager, so we can always run the RootGuardManager in the "diff guard set".

jansel · 2024-10-29T01:51:36Z

torch/_dynamo/eval_frame.py

+    elif _stance.stance == "skip_guard_eval":
+        return torch._C._dynamo.eval_frame.skip_guard_eval_callback_flag


I'm ok with this idea in general, but this seems like a bit of an odd way to do this, since it will also disable compiling of new frames. Maybe this should instead be a global flag in C++ that you can control without changing stances.

We should also put unsafe in the name, and I like the idea of something softer that still checks shapes.

bhack · 2024-11-01T14:23:05Z

This is interesting and I find some similarities to the torch.export Dims. So I think that a common overview between this two concepts it could help to expose it to the users also if we are talking about two different use case.

Can an explicit interface like torch.export Dims be more clear to the user for a warm-up analysis?

I think we could adopt something similar to torch.export logic and API when we want to skip guards but currently I found two main issue with that API?

Not able to pre-declare all the discrete/sparse dims I expect in my program [torch.export] Dim discrete/finite set #136119
Counterintuitive interplay between ranges and guards [torch.export] Detect internal constrains #136216

anijain2305 · 2024-11-03T22:44:44Z

@bhack export is little bit out of scope here. torch.export is AOT while torch.compile is JIT. This completely changes how you represent and run guards.

bhack · 2024-11-03T23:29:06Z

export is little bit out of scope here. torch.export is AOT while torch.compile is JIT.

I know this

This completely changes how you represent and run guards.

Are we sure about this also for the specific skip_guard_eval cases in this thread?

Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up).

I know this is not exactly like the AOT case but in this specific case I see many similarities.
E.g. we can still have cases where I need to JIT optimize the code (e.g. running on different HW instead the same hardware as with export) and in the warmup phase I can really expose all the possible inputs I expect to give to the program.

Also talking about this warmup phase we can pass these inputs as real "live" inputs or we can always pass eventually a discrete set of a continuous range of dims right?
As mentioned in the previous comment currently this is not supported if we consider to have an API similar to export Dims.

I don't know if this is still so off-topic in this specific AOT case after this little expanded rationale/use cases.

I am not a lot deep inside these internals but I still see a logical connection from the public API surface and by the end user model develop/deployment angle.

bhack · 2024-11-03T23:36:11Z

Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up).

P.s. There are also cases where this could be guaranteed, with the owner responsibility, by what the program were already historically collected in the remote cache and all its historically collected dynamic inputs.

youkaichao · 2024-11-04T01:56:18Z

just find this and chime in here. vLLM finally goes to the solution of custom dispatcher, we have some variables not seen by torch.compile, and we dispatch to compiled cache entries based on those variables.

see https://github.com/vllm-project/vllm/blob/main/vllm/compilation/wrapper.py and https://github.com/vllm-project/vllm/blob/91c9ebbb1bfc39e98aa2bd444b9569e5f2f92c9e/vllm/worker/tpu_model_runner.py#L680-L696

not sure if this helps.

# Motivation We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total. Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead" # Design A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run. What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run. When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set. If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here). A bonus point here is that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this. # Evaluation I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% ovehread is bad. And recently we saw this in hqq (low precision LLM generation) - #138386. To put some numbers for perspecitve, low precision LLM infernece reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total. Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead" A must-have consideration is to support fast inference where the model has recompiled (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run. What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set. If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the tree node to increase the weightage of that node. If we just want to run the "guard diff set", we just have to run only those nodes in the tree which have fail_count > 0. This PR relies on this observation to introduce a new API - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this API. This API goes through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error). I tried the hqq model in #138386. With very small changes in the user code (share PR), I see the throughput increase from 160 tokens/sec to 174 tokens/sec. ghstack-source-id: bbfb663 Pull Request resolved: #139038

# Motivation We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total. Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead" # Design A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run. What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run. When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set. If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here). A bonus point here is that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this. # Evaluation I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

anijain2305 · 2024-11-04T19:06:32Z

torch/compiler/__init__.py

+
+
+@contextlib.contextmanager
+def skip_guard_eval_unsafe() -> Generator[None, None, None]:


Bikeshed please.

reduce_overhead_unsafe (probably bad, as users might assume cudagrahps here - anything with the term overhead is probably bad)

# Motivation We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total. Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead" # Design A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run. What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run. When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set. If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here). A bonus point here is that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this. # Evaluation I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% ovehread is bad. And recently we saw this in hqq (low precision LLM generation) - #138386. To put some numbers for perspecitve, low precision LLM infernece reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total. Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead" A must-have consideration is to support fast inference where the model has recompiled (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run. What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set. If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the tree node to increase the weightage of that node. If we just want to run the "guard diff set", we just have to run only those nodes in the tree which have fail_count > 0. This PR relies on this observation to introduce a new API - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this API. This API goes through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error). I tried the hqq model in #138386. With very small changes in the user code (share PR), I see the throughput increase from 160 tokens/sec to 174 tokens/sec. ghstack-source-id: 8a7ed25 Pull Request resolved: #139038

anijain2305 · 2024-11-04T21:39:17Z

@jansel This is ready for real first round of review.

ezyang · 2024-11-05T18:25:56Z

torch/compiler/__init__.py

+
+
+@contextlib.contextmanager
+def skip_guard_eval_unsafe() -> Generator[None, None, None]:


@jansel are you sure about not making it a stance? I really don't see why it shouldn't be a stance, you need to hit all the compilation first before you can actually go unsafe, this seems 100% what the stance API wants to capture.

We could but, I think it is orthogonal to the existing stances. You could have:

skip_guard_eval_unsafe + error_on_recompile

skip_guard_eval_unsafe + eager_on_recompile

skip_guard_eval_unsafe + default

So this would need to be a different arg.

Yup, sounds good to me! Separate arg.

ezyang · 2024-11-05T18:29:58Z

torch/_dynamo/guards.py


        visit(self.root)

+    def propagate_diff_guard_set(self):


I am having trouble squaring the function naming with the PR description (I haven't understood the implementation yet). Your PR description suggests that you look at the "fail guard count" to find out the critical guards that are needed to distinguish various cases. But this has nothing to do with "diffing" the guards. For example, suppose I have:

"test_a() and test_b()"
"test_a() and test_c() and test_d()"

The "diff" of the guards is "test_b()" and "test_c() and test_d()". However, assuming these are the only two compiled frames, the "critical" guard set is literally "test_b()" only: there is no point in testing test_c/test_d at all, it's unnecessary (under the assumption that these are the only two compiled code you ever will actually hit.)

I'm not sure what level of safety I want out of this API. But if I want the fastest set of tests, there won't be any diffing involved...

This is for the suggestion on the "softer" version - i.e. run atleast TENSOR_MATCH guards. So, I am marking all managers that have TENSOR_MATCH guards as include-in-the-diff-guard-set forcibly. And then I do a tree traversal to mark all the parent nodes as include-in-the-diff-guard-set.

If we are going for the absolute minimum diff-guard-set, you dont need this function.

ezyang · 2024-11-05T18:30:21Z

torch/csrc/dynamo/guards.h

 void* convert_to_root_guard_manager(py::object root);
 bool run_root_guard_manager(void* root, PyObject* f_locals);
+void set_run_diff_guard_set(void* root);
+void unset_run_diff_guard_set(void* root);


docs would help

ezyang · 2024-11-05T18:31:30Z

torch/csrc/dynamo/guards.cpp

    GuardManager::add_leaf_guard(std::move(leaf_guard));
  }

+  void include_in_diff_guard_set() {


docs for this and other funcs

ezyang · 2024-11-05T18:47:58Z

torch/compiler/__init__.py

+    will be necessary.
+
+    If this assumption fails, there is a risk of silently producing incorrect
+    results (hence the term "unsafe" in the API name).


There is some sort of nontrivial proof of correctness here that relies on the "trie-ness" of guard sets.

In particular, consider this near-miss counterexample. I have two compiled code with these guards:

a1 and a2 b1

Let us suppose I triggered compilation of these frames with:

a1=True, a2=True, b1=False a1=False, a2=False, b1=True

These inputs will trigger compilation of both frames. From these two compilations, a1 will have a nonzero fail count.

When we skip guard evaluation, our adjusted guards are now:

a1 True

But this is now unsound EVEN if the user only needs to run the compiled code above. Specifically, if I have a1=True, a2=False, b1=True, the second graph would hit (since b1 == True), but because a1 is True I will use the first compiled code anyway.

Why is this a near-miss counterexample? Assume guards are never elided, it is impossible for the second frame to have only b1 as its guard. Instead, our guards should be:

a1 and a2 not a1 and b1

In other words, the invariant is that every compiled code must have the EXACTLY same guards, up until a "critical" guard which must be included in both guard sets but then must vary before you're allowed to now have different guards. Under these guards, a1=True, a2=False, b1=True isn't covered by either compiled code and so under the contract here it is OK to drop.

Unfortunately, it's difficult for me to tell if we 100% can rely on this invariant. One easy way to get into trouble is if we sometimes elide guards (which we do!) I also have some vague concern with what happens when we reorder guards but maybe it's OK because you always have to accumulate a failure on the critical guard before anything gets reordered, so the critical guards won't get dropped.

Yeah, this does not work 😭

I have to think more.

UPDATE - I am convinced that the current state is completely broken. Added a test that fails today - test_cache_line_pickup

ezyang · 2024-11-05T19:25:46Z

torch/csrc/dynamo/extra_state.cpp

-    extra_state->move_to_front(found);
+    if (run_diff_guards) {
+      // Undo the move_front_to_back() done at the beginning of this function.
+      extra_state->move_front_to_back();


ezyang · 2024-11-05T19:27:00Z

torch/csrc/dynamo/eval_frame.c

  PyObject* maybe_cached_code = NULL;
  const char* trace_annotation = "";
-  lookup(extra, locals, backend, &maybe_cached_code, &trace_annotation);
+  lookup(extra, locals, backend, &maybe_cached_code, &trace_annotation, PyObject_IsTrue(run_diff_guard_set));


Why don't you just store a C++ bool as the state here lol

ezyang · 2024-11-05T19:29:24Z

torch/_dynamo/guards.py

+            if in_diff_guard_set:
+                mgr.include_in_diff_guard_set()
+
+            return in_diff_guard_set


My previous comments have been about how to handle guards treated as a list of boolean expressions evaluated in order. But we actually have trees. And now I am a little uncertain what exactly what it means to be a "critical guard" in a tree setting.... like, I don't actually want to do the parent guard tests right? If for all of my compiled code I can assume foo[0].bar exists, I just shouldn't bother testing if those things exist when I traverse down to the tensor right?

ezyang · 2024-11-05T19:30:52Z

torch/csrc/dynamo/extra_state.cpp

+    // entries, the first cache entry (which would be the latest recompile) does
+    // not have any guard managers with fail_count > 0. So, we just move it to
+    // the end.
+    extra_state->move_front_to_back();


@jansel can you remind me why we attempt the most recently compiled cache entry as opposed to the oldest cache entry lol. I guess I can imagine some reasons. Reordering guards seems very dangerous in the unsafe eval world......

MRU is useful for automatic dynamic

The reordering logic is to "fail fast" where we try to check things that fail often first. This is to mitigate the fact that we check multiple cache entries one-by-one.

I don't exactly understand why we are reordering this here though...

jansel

Thoughts on implementing this as a separate guard manager instance (with fewer guards) versus flags that need to be checked many times?

A separate manager might be faster, since you wouldn't need to check flags multiple times.

jansel · 2024-11-05T19:27:19Z

torch/csrc/dynamo/eval_frame.c

+
+// When this flag is turned on, cache lookup runs diff guard set to find the
+// compiled graph.
+static PyObject* run_diff_guard_set;


Should this be just a bool?

jansel · 2024-11-05T19:28:17Z

torch/csrc/dynamo/eval_frame.c

+    PyErr_SetString(PyExc_TypeError, "expected True/False");
+    return NULL;
+  }
+  run_diff_guard_set = value;


For True/False this doesn't matter, but this is technically missing refcounts.

jansel · 2024-11-05T19:28:43Z

torch/csrc/dynamo/eval_frame.c

+}
+
+static PyObject* get_run_diff_guard_set_flag_py(PyObject* dummy, PyObject* args) {
+  return run_diff_guard_set;


Missing incref. This one is actually dangerous because if you run this code enough times the refcount of True/False will reach zero.

# Motivation We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total. Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead" # Design A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run. What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run. When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set. If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here). A bonus point here is that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this. # Evaluation I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% ovehread is bad. And recently we saw this in hqq (low precision LLM generation) - #138386. To put some numbers for perspecitve, low precision LLM infernece reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total. Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead" A must-have consideration is to support fast inference where the model has recompiled (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run. What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set. If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the tree node to increase the weightage of that node. If we just want to run the "guard diff set", we just have to run only those nodes in the tree which have fail_count > 0. This PR relies on this observation to introduce a new API - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this API. This API goes through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error). I tried the hqq model in #138386. With very small changes in the user code (share PR), I see the throughput increase from 160 tokens/sec to 174 tokens/sec. ghstack-source-id: 9332fde Pull Request resolved: #139038

anijain2305 · 2024-11-05T20:11:03Z

test/dynamo/test_skip_guard_eval_unsafe.py

+                dt["x"] = x
+                opt_fn(dt)
+
+    def test_cache_line_pickup(self):


As @ezyang pointed out here, the reliance on fail_count is not enough. This test fails today.

For this to really work properly, we probably have to save the source of all failing guards. And then use that to run the diff guard set.

cc @jansel

anijain2305 · 2024-11-11T08:21:28Z

Moving to #140251

The main reason for closing this is implementation. In the other stack, I am introducing the facility to clone the guard manager which makes it more readable and performant.

anijain2305 mentioned this pull request Oct 28, 2024

[dynamo][guards] Skip no tensor aliasing guards on parameters #138954

Closed

anijain2305 mentioned this pull request Oct 28, 2024

[dynamo][guards] Skip guards on empty nn module hooks #138942

Closed

pytorch-bot bot added ciflow/inductor module: dynamo labels Oct 28, 2024

anijain2305 requested review from Chillee, ezyang and jansel October 28, 2024 05:17

anijain2305 mentioned this pull request Oct 28, 2024

[dynamo][refactor] Remaining cleanup from config-cleanup of enable_cpp_guard_manager #139040

Closed

jansel requested changes Oct 29, 2024

View reviewed changes

anijain2305 changed the title ~~[rfc][dynamo] "skip_guard_eval" stance for power users~~ [rfc][dynamo] "skip_guard_eval_unsafe" API for power users Nov 4, 2024

anijain2305 commented Nov 4, 2024

View reviewed changes

anijain2305 mentioned this pull request Nov 4, 2024

[dynamo][eval_frame] Set the callback to None earlier for guard eval #139655

Closed

anijain2305 requested a review from jansel November 4, 2024 21:38

ezyang reviewed Nov 5, 2024

View reviewed changes

jansel requested changes Nov 5, 2024

View reviewed changes

anijain2305 commented Nov 5, 2024

View reviewed changes

anijain2305 closed this Nov 11, 2024

github-actions bot deleted the gh/anijain2305/564/head branch December 12, 2024 02:11

		elif _stance.stance == "skip_guard_eval":
		return torch._C._dynamo.eval_frame.skip_guard_eval_callback_flag



		@contextlib.contextmanager
		def skip_guard_eval_unsafe() -> Generator[None, None, None]:

[dynamo] "skip_guard_eval_unsafe" API for power users #139038

[dynamo] "skip_guard_eval_unsafe" API for power users #139038

Uh oh!

Conversation

anijain2305 commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Design

Evaluation

Uh oh!

pytorch-bot bot commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139038

❌ 10 New Failures

Uh oh!

ezyang commented Oct 28, 2024

Uh oh!

anijain2305 commented Oct 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bhack commented Nov 1, 2024

Uh oh!

anijain2305 commented Nov 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhack commented Nov 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhack commented Nov 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youkaichao commented Nov 4, 2024

Uh oh!

anijain2305 Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anijain2305 commented Nov 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anijain2305 Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anijain2305 Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jansel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

anijain2305 commented Oct 28, 2024 •

edited

Loading

pytorch-bot bot commented Oct 28, 2024 •

edited

Loading

anijain2305 commented Nov 3, 2024 •

edited

Loading

bhack commented Nov 3, 2024 •

edited

Loading

bhack commented Nov 3, 2024 •

edited

Loading

anijain2305 Nov 4, 2024 •

edited

Loading

anijain2305 Nov 5, 2024 •

edited

Loading

anijain2305 Nov 5, 2024 •

edited

Loading

jansel left a comment •

edited

Loading

anijain2305 Nov 5, 2024 •

edited

Loading