-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[POC] automatic_dynamic_local_pgo #138052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138052
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 114 PendingAs of commit 9835513 with merge base failed to retrieve merge base, please contact dev infra: NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please commit the suggested changes from pytorch's linter.
Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: a300afb Pull Request resolved: #138052
| # TODO: this scheme makes manual inspection of cache entries difficult, | ||
| # consider adding some breadcrumbs in the name for ease of use | ||
| r = os.path.join( | ||
| cache_dir(), "dynamo", sha256_hash(pickle.dumps((filename, firstlineno, name))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pickle.dumps is slow, I'd suggest f"{filename},{firstlineno},{name}".encode("utf-8")
| try: | ||
| r = pickle.load(f) | ||
| except Exception: | ||
| log.warning("get_code_object_cache failed while reading %s", path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include error?
| # have multiple distinct code objects that have identical filename/line | ||
| # number, we will share automatic dynamic information across them (TODO: | ||
| # change default automatic dynamic behavior so it also crosstalks in this way) | ||
| automatic_dynamic_local_pgo = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: this isn't wired up yet
Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: a300afb Pull Request resolved: pytorch#138052
Previously: #138052 but the implementation is done from scratch, so I open a new PR. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: e1f0761 Pull Request resolved: #139001
|
Obsoleted by #139001 |
While working on automatic dynamic PGO (#138052) one abstract property I was looking for out of profile information is that it formed a semilattice: I could join together two profiles and get a merged profile that is consistent with the profiles that I saw in both cases. While working on this data structure that supported joins, I realized that the base automatic dynamic algorithm could be implemented in this way, therefore this refactor. The basic recipe is that we now support a join operation on FrameStateSizeEntry. Intuitively, if you join two sizes that are equal, you get back that size (join(2, 2) == 2), but if you join two different sizes you get a special singleton auto_dynamic indicating that the size of the tensor is dynamic (join(2, 3) == auto_dynamic). So now, the automatic dynamic algorithm is: (1) compute the FrameStateSizeEntry that corresponds to the concrete values we've seen, and (2) join it into the ambient FrameStateSizeEntry. As a bonus, compiler collectives can buy into the same abstraction (we're simply distributing FrameStateSizeEntry from each node to every other node). For convenience, I also added the necessary `auto_unset` extra state which is the identity element (which makes our semilattice bounded from both top and bottom). Here, join(2, auto_unset) == 2. While doing this, there was a complication: the infer stride algorithm wasn't technically a semilattice. Here, I did what I suggested in the original code review #130232 which is stop using a heuristic, and instead replicate the stride inference algorithm in automatic dynamic. This means that when I join strides together, I don't join their concrete values, instead, if a stride can be inferred as the contiguous stride for a particular inner dimension, then you represent it as InferStride(dim). There's an example in code which I recommend looking at. Some other extra things that are happening in this PR: * I tried to deduplicate the size/stride automatic dynamic logic as much as possible. So hopefully less code to review here. * I had to reimplement all the logging. For the most part I tried to track the logging as closely to the original as possible, but I think we could be emitting less Chrome events here * The `marked_dynamic` handling is still preserved as is, but I kind of don't like it and we should figure out how to put it somewhere else Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: #138717 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #138693
Previously: #138052 but the implementation is done from scratch, so I open a new PR. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: 6dfd835 Pull Request resolved: #139001
Previously: #138052 but the implementation is done from scratch, so I open a new PR. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: 868296a Pull Request resolved: #139001
Previously: #138052 but the implementation is done from scratch, so I open a new PR. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: 79b1b5c Pull Request resolved: #139001
Previously: #138052 but the implementation is done from scratch, so I open a new PR. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: 9e834e0 Pull Request resolved: #139001
Previously: #138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <[email protected]> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: #139001 Approved by: https://github.com/oulgen
Previously: #138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <[email protected]> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: #139001 Approved by: https://github.com/oulgen
Previously: #138052 but the implementation is done from scratch, so I open a new PR. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: 78f0975 Pull Request resolved: #139001
Previously: #138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: #139001 Approved by: https://github.com/oulgen
Previously: pytorch#138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <[email protected]> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: pytorch#139001 Approved by: https://github.com/oulgen
Previously: pytorch#138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <[email protected]> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: pytorch#139001 Approved by: https://github.com/oulgen
Previously: pytorch#138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#139001 Approved by: https://github.com/oulgen
Stack from ghstack (oldest at bottom):
The important comment:
This is the dumbest, simplest thing I could manage to code in 1.5hrs. Here's how I tested it:
You can see on the second run it only compiles once, and cache hits.
There is a lot of extra polish needed, hopefully mostly all noted in TODOs. One big gap is how exactly to invalidate this cache. The config might want to be some sort of epoch number you can bump up to invalidate old caches. Won't polish until we agree this is a good approach.
Signed-off-by: Edward Z. Yang [email protected]
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @rec