Add support for ONNX-only and Caffe2 ONNX export#4120
Add support for ONNX-only and Caffe2 ONNX export#4120thiagocrepaldi wants to merge 5 commits intofacebookresearch:mainfrom
Conversation
|
Hi, any update on this merge request? |
This PR will probably need to be split in 2. One that fixes the PointRend model (and requires a new symbolic ::grid_sampler) and a second that fixes everything else. The former should take some time, but the second should be fairly quick |
Hope this new feature will come soon! If I can help you for something don't esitate to contact me |
|
Hi, any upsate on this feature? |
|
I am working on this today. If you know how to add new onnx symbolics to pytorch, you could try working on adding grid_sampler |
|
#4132 breaks ONNX export. Root cause is related to how |
pytorch/pytorch#76159 adds |
|
We need pytorch/pytorch#76498 to fix ONNX export after #4132 was merged We also need pytorch/pytorch#76159 to add After both of these, this PR is ready to go |
|
@garymm somehow I cannot request another review from you through GH's UI, but I have addressed your comments |
6087718 to
595eaef
Compare
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
17 similar comments
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
The main root cause was that torch.nn.GroupNorm changed its behavior on newer pyTorch versions (pytorch/pytorch#74293). this PR fixes the RetinaNetHead according to the new behavior Also, add_export_config was not exposed when caffe2 was not compiled along with pytorch, preventing users with legacy scripts to call it. This PR exposes it and adds a warning message informing users about its deprecation on future versions This PR also adds torch_export_onnx.py with several unit tests for detectron2 models. ONNX is added as dependency to the CI to allow the aforementioned tests to run Lastly, because detectron2 CI still uses old pytorch versions (1.8, 1.9 and 1.10), this PR adds custom symbolic functions to fix bugs on these versions
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
Cheers @ppwwyyxx! Sorry for the trouble, my goal was to simplify the merge process by combining both approved PRs #4120/#4153 which were waiting for the merge. In the meantime, I found a couple minor bugs that I decided to append so that we can provide an end-to-end ONNX export implementation that works out of the box. Instead of reverting it, I've just rebased the PR, splitting it in 5 commits with detailed explanation in each of them. I kindly ask you that you give it a shot in reviewing them, and if you still think that I should split them again, I will do it. Happy friday |
Definitely true, I'm trying to help @thiagocrepaldi as far as I can! |
This commit fixes Caffe2Tracer.export_caffe2 API for the latest pytorch repo Before this change, all tests at tests/test_export_caffe2.py fail because the latest onnx releases does not have onnx.optimizer namespace anymore and detectron2 code tried to use fuse_bn_into_conv from it. However, fuse_bn_into_conv optimization previously present in onnx.optimizer is already performed by torch.onnx.export. Therefore it wwas removed from detectron2 code. Depends on pytorch/pytorch#75718 Fixes facebookresearch#3488 Fixes pytorch/pytorch#69674 (PyTorch repo)
Although facebookresearch#4120 exported a valid ONNX graph, after running it with ONNX Runtime, I've realized that all model inputs were exported as constants. The root cause was that at detectron2/structures/image_list.py, the padding of batched images were done on a temporary (copy) variable, which the ONNX tracer cannot track the user input, leading in suppressing inputs and storing them as constant during export. Also, len(torch.Tensor) is traced as a constant shape, which poses a problem if the next input has different shape. This PR fixes that by using torch.Tensor.shape[0] instead
TracingAdapter creates extra outputs (through flatten_to_tuple) to store metadata to rebuild the original data format. This is unnecessary during ONNX export as the original data will never be reconstructed to its original format using Schema.__call__ API. This PR suppresses such extra output constants during torch.onnx.export() execution. Outside this API, the behavior is not changed, ensuring BC. Although not stricly necessary to achieve the same numerical results as PyTorch, when a ONNX model schema is compared to PyTorch's, the diffrent number of outputs (ONNX model will have more outputs than PyTorch) may not only confuse users, but also result in false negative when coding model comparison helpers.
PyTorch ONNX export has some limitations when exporting mdoels with torch.jit.script. This PR suppresses the `torch.jit.script_if_tracing` when torch.onnx.export is in execution. Outside this API context, the original behavior is maintained to ensure BC This fixes detectron2.modeling.roi_heads.KRCNNConvDeconvUpsampleHead ONNX export
|
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
| try: | ||
| norm_name = str(type(get_norm(norm, 1))) | ||
| if "BN" in norm_name: | ||
| logger.warning( | ||
| f"Shared BatchNorm (type={norm_name}) may not work well in RetinaNetHead." | ||
| ) | ||
| except ValueError: | ||
| # https://github.com/pytorch/pytorch/pull/74293 | ||
| if isinstance(norm, str): | ||
| logger.warning(f"No BatchNorm was found for '{norm}' for RetinaNetHead.") |
There was a problem hiding this comment.
Please undo this change. There is no reason to ignore the exception.
There was a problem hiding this comment.
The first commit cb755fe#diff-362c43f324b3aaa4d55eae710f24245e361fd9be77cb2ecaded84672ef88b0d5R356-R365 looks good but there is a problematic change that I don't remember have reviewed. Added a comment.
The second commit 8f86528 looks OK, but like I said caffe2 is deprecated so there is a chance that it may not get accepted. That should be a decision separated from the ONNX-only export so it's better to be a separate PR.
The third commit b352172 appears to be fixing something but I cannot see what problem it's trying to fix so I cannot accept it.
If it is trying to fix a caffe2 export correctness issue please put it in the caffe2-onnx PR, and in that case it's less likely to be accepted because it's more work for a deprecated functionality.
If it's trying to fix a onnx-only export, correctness issue, please put it in this onnx-only export PR and add a test to show what problems it's trying to fix.
Similarly I don't understand what the 4th commit f108953 is trying to do so I cannot accept it.
It seems to be about a user experience improvement but not a bug -- in that case then it should be a separate PR. Also I expect to see demonstration of the problem first, before jumping to solutions - there is a chance that the extra outputs are useful and should not be removed, but I cannot tell if I don't know what the problem is.
I cannot tell without running some tests myself whether the 5th commit 6fabe50 is correct or not. I strongly suspect it's incorrect, because the scripted function has a forloop with dynamic length which I assumed ONNX tracing cannot handle. But I don't know about "dynamic_axes" and might be wrong here. Anyway that's a conversion better happen separately as well, because this does not affect the basic (fast/mask r-cnn) models.
In summary, all of the unreviewed changes will need more iterations. Having them together will slow down the merge.
Summary of changes
This PR fixes both ONNX-only and Caffe2 ONNX exporters for the latest versions of this repo and PyTorch.
For ONNX-only, the main issue is that
add_export_config(cfg)is not exposed when Caffe2 is not compiled along with PyTorch, but for ONNX-only scenarios, such dependency is not needed. Therefore,add_export_configis moved fromdetectron2/export/api.pytodetectron2/export/__init__.pyA second contribution is a new
test_export_onnx.pytest file that export almost the same models as thetest_export_tracing.pytests.For the Caffe2-ONNX, the main issue was a dependency on ONNX optimizer pass which is deprecated in newer ONNX versions.
This PR removes such dependency because
fuse_bn_into_convoptimization pass is already performed bytorch.onnx.exportanyway.Fixes #3488
Fixes pytorch/pytorch#69674 (PyTorch repo)