Skip to content

Conversation

@anijain2305
Copy link
Contributor

@anijain2305 anijain2305 commented Oct 28, 2024

Stack from ghstack (oldest at bottom):

Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0".

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

Evaluation

I tried the hqq model in #138386. With very small changes in the user code (hqq PR), I see the throughput increase from 160 tokens/sec to 174 tokens/sec.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @rec

We have spent quite some time this year on improving guard performance
and soundness. Nevertheless, guards STILL take time. We have seen
multiple requests/evidences from power users where they want to have
almost 0% guard overhead. First, we saw this in vLLM where even 1%
ovehread is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put
some numbers for perspecitve, low precision LLM infernece reaches around
250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard
overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more
recompilations in the steady state, give us the lowest guard overhead"

A must-have consideration is to support fast inference where the model
has recompiled (could be because of dynamism, or just tensor dtype
change in the case of hqq). So, we still have to run the guards to
figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e.
minimals set of guards that we can run to choose the compiled graph.
Note that this works ONLY with the assumption that users really
guarantee no more recompilation scenarios (no more mutations, no more
dynamism after the model has been warmed up).

When we designed C++ guards, Ed and Voz suggested to use Trie-structure
to directly represent this "diff guard set". But due to complexity, we
went for tree structure and relied on a GuardManager state -
"fail_count" - to fail fast. I realized that we can rely on this
"fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns
have failed. Whenever a guard check_fn fails, we increment the counter
in the tree node to increase the weightage of that node. If we just want
to run the "guard diff set", we just have to run only those nodes in the
tree which have fail_count > 0.

This PR relies on this observation to introduce a new stance -
"skip_guard_eval". The idea is that user will warm up their model with
torch.compile, and the run the steady state with this stance. This
stance go through the existing cache lines for the intercepted code
object but only runs the diff guard set. This dramatically reduces the
guard overhead. In case, all guards fail, we fall back to eager (however
if this happens then user is violating the assumption, so we should
perhaps hard error).

I tried the hqq model in
#138386. With very small
changes in the user code (share PR), I see the throughput increase from 160
tokens/sec to 174 tokens/sec.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139038

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures

As of commit 9aa56c7 with merge base 740054f (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

# Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

# Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". 

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is  that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

# Evaluation

I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
# Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

# Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". 

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is  that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

# Evaluation

I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
# Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

# Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". 

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is  that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

# Evaluation

I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
# Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

# Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". 

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is  that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

# Evaluation

I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
anijain2305 added a commit that referenced this pull request Oct 28, 2024
We have spent quite some time this year on improving guard performance
and soundness. Nevertheless, guards STILL take time. We have seen
multiple requests/evidences from power users where they want to have
almost 0% guard overhead. First, we saw this in vLLM where even 1%
ovehread is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put
some numbers for perspecitve, low precision LLM infernece reaches around
250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard
overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more
recompilations in the steady state, give us the lowest guard overhead"

A must-have consideration is to support fast inference where the model
has recompiled (could be because of dynamism, or just tensor dtype
change in the case of hqq). So, we still have to run the guards to
figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e.
minimals set of guards that we can run to choose the compiled graph.
Note that this works ONLY with the assumption that users really
guarantee no more recompilation scenarios (no more mutations, no more
dynamism after the model has been warmed up).

When we designed C++ guards, Ed and Voz suggested to use Trie-structure
to directly represent this "diff guard set". But due to complexity, we
went for tree structure and relied on a GuardManager state -
"fail_count" - to fail fast. I realized that we can rely on this
"fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns
have failed. Whenever a guard check_fn fails, we increment the counter
in the tree node to increase the weightage of that node. If we just want
to run the "guard diff set", we just have to run only those nodes in the
tree which have fail_count > 0.

This PR relies on this observation to introduce a new stance -
"skip_guard_eval". The idea is that user will warm up their model with
torch.compile, and the run the steady state with this stance. This
stance go through the existing cache lines for the intercepted code
object but only runs the diff guard set. This dramatically reduces the
guard overhead. In case, all guards fail, we fall back to eager (however
if this happens then user is violating the assumption, so we should
perhaps hard error).

I tried the hqq model in
#138386. With very small
changes in the user code (share PR), I see the throughput increase from 160
tokens/sec to 174 tokens/sec.

ghstack-source-id: cdc1220
Pull Request resolved: #139038
@ezyang
Copy link
Contributor

ezyang commented Oct 28, 2024

I'm positive on this. We should log entrance to this stance to tlparse so we can easily tell if someone is in this mode for debugging purposes. I think hard error on miss is a must. I might suggest a softer version of this that still does dynamic shapes guards (or in general, guards that are "likely" to catch correctness problems).

I'm trying to remember who was most against the sticky cache concept. Maybe @gchanan but he's not available right now. I would show it to that person to get the stiffest feedback.

@anijain2305
Copy link
Contributor Author

I might suggest a softer version of this that still does dynamic shapes guards (or in general, guards that are "likely" to catch correctness problems).

Sounds good, this is quite easy to do today. Dynamic shape guards are sitting on the RootGuardManager, so we can always run the RootGuardManager in the "diff guard set".

Comment on lines 169 to 170
elif _stance.stance == "skip_guard_eval":
return torch._C._dynamo.eval_frame.skip_guard_eval_callback_flag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this idea in general, but this seems like a bit of an odd way to do this, since it will also disable compiling of new frames. Maybe this should instead be a global flag in C++ that you can control without changing stances.

We should also put unsafe in the name, and I like the idea of something softer that still checks shapes.

@bhack
Copy link
Contributor

bhack commented Nov 1, 2024

This is interesting and I find some similarities to the torch.export Dims. So I think that a common overview between this two concepts it could help to expose it to the users also if we are talking about two different use case.

Can an explicit interface like torch.export Dims be more clear to the user for a warm-up analysis?

I think we could adopt something similar to torch.export logic and API when we want to skip guards but currently I found two main issue with that API?

@anijain2305
Copy link
Contributor Author

anijain2305 commented Nov 3, 2024

@bhack export is little bit out of scope here. torch.export is AOT while torch.compile is JIT. This completely changes how you represent and run guards.

@bhack
Copy link
Contributor

bhack commented Nov 3, 2024

export is little bit out of scope here. torch.export is AOT while torch.compile is JIT.

I know this

This completely changes how you represent and run guards.

Are we sure about this also for the specific skip_guard_eval cases in this thread?

Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up).

I know this is not exactly like the AOT case but in this specific case I see many similarities.
E.g. we can still have cases where I need to JIT optimize the code (e.g. running on different HW instead the same hardware as with export) and in the warmup phase I can really expose all the possible inputs I expect to give to the program.

Also talking about this warmup phase we can pass these inputs as real "live" inputs or we can always pass eventually a discrete set of a continuous range of dims right?
As mentioned in the previous comment currently this is not supported if we consider to have an API similar to export Dims.

I don't know if this is still so off-topic in this specific AOT case after this little expanded rationale/use cases.

I am not a lot deep inside these internals but I still see a logical connection from the public API surface and by the end user model develop/deployment angle.

@bhack
Copy link
Contributor

bhack commented Nov 3, 2024

Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up).

P.s. There are also cases where this could be guaranteed, with the owner responsibility, by what the program were already historically collected in the remote cache and all its historically collected dynamic inputs.

@youkaichao
Copy link
Collaborator

just find this and chime in here. vLLM finally goes to the solution of custom dispatcher, we have some variables not seen by torch.compile, and we dispatch to compiled cache entries based on those variables.

see https://github.com/vllm-project/vllm/blob/main/vllm/compilation/wrapper.py and https://github.com/vllm-project/vllm/blob/91c9ebbb1bfc39e98aa2bd444b9569e5f2f92c9e/vllm/worker/tpu_model_runner.py#L680-L696

not sure if this helps.

# Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

# Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". 

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is  that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

# Evaluation

I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
@anijain2305 anijain2305 changed the title [rfc][dynamo] "skip_guard_eval" stance for power users [rfc][dynamo] "skip_guard_eval_unsafe" API for power users Nov 4, 2024
# Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

# Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". 

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is  that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

# Evaluation

I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
anijain2305 added a commit that referenced this pull request Nov 4, 2024
We have spent quite some time this year on improving guard performance
and soundness. Nevertheless, guards STILL take time. We have seen
multiple requests/evidences from power users where they want to have
almost 0% guard overhead. First, we saw this in vLLM where even 1%
ovehread is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put
some numbers for perspecitve, low precision LLM infernece reaches around
250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard
overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more
recompilations in the steady state, give us the lowest guard overhead"

A must-have consideration is to support fast inference where the model
has recompiled (could be because of dynamism, or just tensor dtype
change in the case of hqq). So, we still have to run the guards to
figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e.
minimals set of guards that we can run to choose the compiled graph.
Note that this works ONLY with the assumption that users really
guarantee no more recompilation scenarios (no more mutations, no more
dynamism after the model has been warmed up).

When we designed C++ guards, Ed and Voz suggested to use Trie-structure
to directly represent this "diff guard set". But due to complexity, we
went for tree structure and relied on a GuardManager state -
"fail_count" - to fail fast. I realized that we can rely on this
"fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns
have failed. Whenever a guard check_fn fails, we increment the counter
in the tree node to increase the weightage of that node. If we just want
to run the "guard diff set", we just have to run only those nodes in the
tree which have fail_count > 0.

This PR relies on this observation to introduce a new API -
"skip_guard_eval". The idea is that user will warm up their model with
torch.compile, and the run the steady state with this API. This
API goes through the existing cache lines for the intercepted code
object but only runs the diff guard set. This dramatically reduces the
guard overhead. In case, all guards fail, we fall back to eager (however
if this happens then user is violating the assumption, so we should
perhaps hard error).

I tried the hqq model in
#138386. With very small
changes in the user code (share PR), I see the throughput increase from 160
tokens/sec to 174 tokens/sec.

ghstack-source-id: bbfb663
Pull Request resolved: #139038
# Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

# Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". 

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is  that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

# Evaluation

I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]


@contextlib.contextmanager
def skip_guard_eval_unsafe() -> Generator[None, None, None]:
Copy link
Contributor Author

@anijain2305 anijain2305 Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshed please.

  1. reduce_overhead_unsafe (probably bad, as users might assume cudagrahps here - anything with the term overhead is probably bad)

# Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

# Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". 

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is  that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

# Evaluation

I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
# Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

# Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". 

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is  that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

# Evaluation

I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
# Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

# Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". 

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is  that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

# Evaluation

I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
anijain2305 added a commit that referenced this pull request Nov 4, 2024
We have spent quite some time this year on improving guard performance
and soundness. Nevertheless, guards STILL take time. We have seen
multiple requests/evidences from power users where they want to have
almost 0% guard overhead. First, we saw this in vLLM where even 1%
ovehread is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put
some numbers for perspecitve, low precision LLM infernece reaches around
250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard
overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more
recompilations in the steady state, give us the lowest guard overhead"

A must-have consideration is to support fast inference where the model
has recompiled (could be because of dynamism, or just tensor dtype
change in the case of hqq). So, we still have to run the guards to
figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e.
minimals set of guards that we can run to choose the compiled graph.
Note that this works ONLY with the assumption that users really
guarantee no more recompilation scenarios (no more mutations, no more
dynamism after the model has been warmed up).

When we designed C++ guards, Ed and Voz suggested to use Trie-structure
to directly represent this "diff guard set". But due to complexity, we
went for tree structure and relied on a GuardManager state -
"fail_count" - to fail fast. I realized that we can rely on this
"fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns
have failed. Whenever a guard check_fn fails, we increment the counter
in the tree node to increase the weightage of that node. If we just want
to run the "guard diff set", we just have to run only those nodes in the
tree which have fail_count > 0.

This PR relies on this observation to introduce a new API -
"skip_guard_eval". The idea is that user will warm up their model with
torch.compile, and the run the steady state with this API. This
API goes through the existing cache lines for the intercepted code
object but only runs the diff guard set. This dramatically reduces the
guard overhead. In case, all guards fail, we fall back to eager (however
if this happens then user is violating the assumption, so we should
perhaps hard error).

I tried the hqq model in
#138386. With very small
changes in the user code (share PR), I see the throughput increase from 160
tokens/sec to 174 tokens/sec.

ghstack-source-id: 8a7ed25
Pull Request resolved: #139038
@anijain2305 anijain2305 requested a review from jansel November 4, 2024 21:38
@anijain2305
Copy link
Contributor Author

@jansel This is ready for real first round of review.



@contextlib.contextmanager
def skip_guard_eval_unsafe() -> Generator[None, None, None]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jansel are you sure about not making it a stance? I really don't see why it shouldn't be a stance, you need to hit all the compilation first before you can actually go unsafe, this seems 100% what the stance API wants to capture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but, I think it is orthogonal to the existing stances. You could have:

  • skip_guard_eval_unsafe + error_on_recompile
  • skip_guard_eval_unsafe + eager_on_recompile
  • skip_guard_eval_unsafe + default

So this would need to be a different arg.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, sounds good to me! Separate arg.


visit(self.root)

def propagate_diff_guard_set(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble squaring the function naming with the PR description (I haven't understood the implementation yet). Your PR description suggests that you look at the "fail guard count" to find out the critical guards that are needed to distinguish various cases. But this has nothing to do with "diffing" the guards. For example, suppose I have:

"test_a() and test_b()"
"test_a() and test_c() and test_d()"

The "diff" of the guards is "test_b()" and "test_c() and test_d()". However, assuming these are the only two compiled frames, the "critical" guard set is literally "test_b()" only: there is no point in testing test_c/test_d at all, it's unnecessary (under the assumption that these are the only two compiled code you ever will actually hit.)

I'm not sure what level of safety I want out of this API. But if I want the fastest set of tests, there won't be any diffing involved...

Copy link
Contributor Author

@anijain2305 anijain2305 Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the suggestion on the "softer" version - i.e. run atleast TENSOR_MATCH guards. So, I am marking all managers that have TENSOR_MATCH guards as include-in-the-diff-guard-set forcibly. And then I do a tree traversal to mark all the parent nodes as include-in-the-diff-guard-set.

If we are going for the absolute minimum diff-guard-set, you dont need this function.

void* convert_to_root_guard_manager(py::object root);
bool run_root_guard_manager(void* root, PyObject* f_locals);
void set_run_diff_guard_set(void* root);
void unset_run_diff_guard_set(void* root);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs would help

GuardManager::add_leaf_guard(std::move(leaf_guard));
}

void include_in_diff_guard_set() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs for this and other funcs

will be necessary.
If this assumption fails, there is a risk of silently producing incorrect
results (hence the term "unsafe" in the API name).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some sort of nontrivial proof of correctness here that relies on the "trie-ness" of guard sets.

In particular, consider this near-miss counterexample. I have two compiled code with these guards:

a1 and a2
b1

Let us suppose I triggered compilation of these frames with:

a1=True, a2=True, b1=False
a1=False, a2=False, b1=True

These inputs will trigger compilation of both frames. From these two compilations, a1 will have a nonzero fail count.

When we skip guard evaluation, our adjusted guards are now:

a1
True

But this is now unsound EVEN if the user only needs to run the compiled code above. Specifically, if I have a1=True, a2=False, b1=True, the second graph would hit (since b1 == True), but because a1 is True I will use the first compiled code anyway.

Why is this a near-miss counterexample? Assume guards are never elided, it is impossible for the second frame to have only b1 as its guard. Instead, our guards should be:

a1 and a2
not a1 and b1

In other words, the invariant is that every compiled code must have the EXACTLY same guards, up until a "critical" guard which must be included in both guard sets but then must vary before you're allowed to now have different guards. Under these guards, a1=True, a2=False, b1=True isn't covered by either compiled code and so under the contract here it is OK to drop.

Unfortunately, it's difficult for me to tell if we 100% can rely on this invariant. One easy way to get into trouble is if we sometimes elide guards (which we do!) I also have some vague concern with what happens when we reorder guards but maybe it's OK because you always have to accumulate a failure on the critical guard before anything gets reordered, so the critical guards won't get dropped.

Copy link
Contributor Author

@anijain2305 anijain2305 Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this does not work 😭

I have to think more.

UPDATE - I am convinced that the current state is completely broken. Added a test that fails today - test_cache_line_pickup

extra_state->move_to_front(found);
if (run_diff_guards) {
// Undo the move_front_to_back() done at the beginning of this function.
extra_state->move_front_to_back();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come?

PyObject* maybe_cached_code = NULL;
const char* trace_annotation = "";
lookup(extra, locals, backend, &maybe_cached_code, &trace_annotation);
lookup(extra, locals, backend, &maybe_cached_code, &trace_annotation, PyObject_IsTrue(run_diff_guard_set));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you just store a C++ bool as the state here lol

if in_diff_guard_set:
mgr.include_in_diff_guard_set()

return in_diff_guard_set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous comments have been about how to handle guards treated as a list of boolean expressions evaluated in order. But we actually have trees. And now I am a little uncertain what exactly what it means to be a "critical guard" in a tree setting.... like, I don't actually want to do the parent guard tests right? If for all of my compiled code I can assume foo[0].bar exists, I just shouldn't bother testing if those things exist when I traverse down to the tensor right?

// entries, the first cache entry (which would be the latest recompile) does
// not have any guard managers with fail_count > 0. So, we just move it to
// the end.
extra_state->move_front_to_back();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jansel can you remind me why we attempt the most recently compiled cache entry as opposed to the oldest cache entry lol. I guess I can imagine some reasons. Reordering guards seems very dangerous in the unsafe eval world......

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MRU is useful for automatic dynamic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reordering logic is to "fail fast" where we try to check things that fail often first. This is to mitigate the fact that we check multiple cache entries one-by-one.

I don't exactly understand why we are reordering this here though...

Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on implementing this as a separate guard manager instance (with fewer guards) versus flags that need to be checked many times?

A separate manager might be faster, since you wouldn't need to check flags multiple times.


// When this flag is turned on, cache lookup runs diff guard set to find the
// compiled graph.
static PyObject* run_diff_guard_set;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be just a bool?

PyErr_SetString(PyExc_TypeError, "expected True/False");
return NULL;
}
run_diff_guard_set = value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For True/False this doesn't matter, but this is technically missing refcounts.

}

static PyObject* get_run_diff_guard_set_flag_py(PyObject* dummy, PyObject* args) {
return run_diff_guard_set;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing incref. This one is actually dangerous because if you run this code enough times the refcount of True/False will reach zero.

# Motivation

We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead"

# Design

A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run.

When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". 

This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here).

A bonus point here is  that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this.

# Evaluation

I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](dropbox/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec

[ghstack-poisoned]
anijain2305 added a commit that referenced this pull request Nov 5, 2024
We have spent quite some time this year on improving guard performance
and soundness. Nevertheless, guards STILL take time. We have seen
multiple requests/evidences from power users where they want to have
almost 0% guard overhead. First, we saw this in vLLM where even 1%
ovehread is bad. And recently we saw this in hqq (low precision LLM
generation) - #138386. To put
some numbers for perspecitve, low precision LLM infernece reaches around
250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard
overhead is even 200 us, its still 5% overhead in total.

Here, users ask - "we can guarantee that there will no more
recompilations in the steady state, give us the lowest guard overhead"

A must-have consideration is to support fast inference where the model
has recompiled (could be because of dynamism, or just tensor dtype
change in the case of hqq). So, we still have to run the guards to
figure out which compiled graph to run.

What we need is the "minimal set of differentiating guards" - i.e.
minimals set of guards that we can run to choose the compiled graph.
Note that this works ONLY with the assumption that users really
guarantee no more recompilation scenarios (no more mutations, no more
dynamism after the model has been warmed up).

When we designed C++ guards, Ed and Voz suggested to use Trie-structure
to directly represent this "diff guard set". But due to complexity, we
went for tree structure and relied on a GuardManager state -
"fail_count" - to fail fast. I realized that we can rely on this
"fail_count" to find the diff guard set.

If we recompile, this means that all the cache line guard eval check_fns
have failed. Whenever a guard check_fn fails, we increment the counter
in the tree node to increase the weightage of that node. If we just want
to run the "guard diff set", we just have to run only those nodes in the
tree which have fail_count > 0.

This PR relies on this observation to introduce a new API -
"skip_guard_eval". The idea is that user will warm up their model with
torch.compile, and the run the steady state with this API. This
API goes through the existing cache lines for the intercepted code
object but only runs the diff guard set. This dramatically reduces the
guard overhead. In case, all guards fail, we fall back to eager (however
if this happens then user is violating the assumption, so we should
perhaps hard error).

I tried the hqq model in
#138386. With very small
changes in the user code (share PR), I see the throughput increase from 160
tokens/sec to 174 tokens/sec.

ghstack-source-id: 9332fde
Pull Request resolved: #139038
dt["x"] = x
opt_fn(dt)

def test_cache_line_pickup(self):
Copy link
Contributor Author

@anijain2305 anijain2305 Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @ezyang pointed out here, the reliance on fail_count is not enough. This test fails today.

For this to really work properly, we probably have to save the source of all failing guards. And then use that to run the diff guard set.

cc @jansel

@anijain2305
Copy link
Contributor Author

Moving to #140251

The main reason for closing this is implementation. In the other stack, I am introducing the facility to clone the guard manager which makes it more readable and performant.

@github-actions github-actions bot deleted the gh/anijain2305/564/head branch December 12, 2024 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request module: dynamo topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants