Optimize dynamo guards for global hooks dicts#96246
Optimize dynamo guards for global hooks dicts#96246wconstab wants to merge 1 commit intogh/wconstab/127/basefrom
Conversation
Use 'BOOL_FALSE' guard (which codegens to `not value`) in place of 'DICT_KEYS' guard (which codegens to call a slower type check function) when guarding on global nnmodule dicts being empty. Note: it's definitely not foolproof to use BOOL_FALSE and skip type check, since runtime types could fail to execute bool(runtime_obj), e.g. if obj is a tensor. We are opting hooks dicts into special treatment on the assumption of well behaved users. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96246
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e6b51d6: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Use 'BOOL_FALSE' guard (which codegens to `not value`) in place of 'DICT_KEYS' guard (which codegens to call a slower type check function) when guarding on global nnmodule dicts being empty. Note: it's definitely not foolproof to use BOOL_FALSE and skip type check, since runtime types could fail to execute bool(runtime_obj), e.g. if obj is a tensor. We are opting hooks dicts into special treatment on the assumption of well behaved users. ghstack-source-id: 9d7425c Pull Request resolved: #96246
|
Unknown label
|
|
Looks like I can't run the perf CI currently as it's being revamped. But a local run on one model that seemed particularly sensitive to guards overhead in my previous sweeps (MobileBertForQuestionAnswering)
digging in further, I realize I missed something in my previous explanation of why my optimization made things faster. When _has_hooks bool exists, we guard on its literal value matching python's singleton False obj. in other words, With my attempt to run a faster 'BOOL_FALSE' guard, i go from checking the length and type of a dict to calling the
For context, in MobileBertForQuestionAnswering there are 651 separate nnModules in the model, requiring
So the speedup from the 'unsound' nnmodule PR is multiplicative 4x * 3x ~= 12x (3x coming from 0.57 / 0.18), but i don't think there is a way to do this optimization without changing the nnmodule side. Unless I am mistaken, this only leaves
|
What about moving nn module guards to c++? |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack (oldest at bottom):
Use 'BOOL_FALSE' guard (which codegens to
not value) in place of'DICT_KEYS' guard (which codegens to call a slower type check function)
when guarding on global nnmodule dicts being empty.
Note: it's definitely not foolproof to use BOOL_FALSE and skip type check,
since runtime types could fail to execute bool(runtime_obj), e.g. if obj is
a tensor. We are opting hooks dicts into special treatment on the assumption
of well behaved users.
Note to reviewers: my previous attempt to use BOOL_FALSE only applied to per-nn.module hooks. I am guessing that this attempt will not yield significant gains since it only affects the 4 global dicts and not the N copies of per-module dicts, but it could be that the runtime of check_type_id is significant enough to matter even there.
cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire