Skip to content

Add support for ONNX-only and Caffe2 ONNX export#4120

Closed
thiagocrepaldi wants to merge 5 commits intofacebookresearch:mainfrom
thiagocrepaldi:main
Closed

Add support for ONNX-only and Caffe2 ONNX export#4120
thiagocrepaldi wants to merge 5 commits intofacebookresearch:mainfrom
thiagocrepaldi:main

Conversation

@thiagocrepaldi
Copy link
Copy Markdown
Contributor

@thiagocrepaldi thiagocrepaldi commented Apr 1, 2022

Summary of changes

This PR fixes both ONNX-only and Caffe2 ONNX exporters for the latest versions of this repo and PyTorch.

For ONNX-only, the main issue is that add_export_config(cfg) is not exposed when Caffe2 is not compiled along with PyTorch, but for ONNX-only scenarios, such dependency is not needed. Therefore, add_export_config is moved from detectron2/export/api.py to detectron2/export/__init__.py

A second contribution is a new test_export_onnx.py test file that export almost the same models as the test_export_tracing.py tests.

For the Caffe2-ONNX, the main issue was a dependency on ONNX optimizer pass which is deprecated in newer ONNX versions.
This PR removes such dependency because fuse_bn_into_conv optimization pass is already performed by torch.onnx.export anyway.

Fixes #3488
Fixes pytorch/pytorch#69674 (PyTorch repo)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 1, 2022
@thiagocrepaldi thiagocrepaldi changed the title Fix caffe2 export support Fix ONNX-only export support Apr 11, 2022
@thiagocrepaldi thiagocrepaldi changed the title Fix ONNX-only export support Add support for ONNX-only export Apr 12, 2022
@thiagocrepaldi thiagocrepaldi marked this pull request as ready for review April 12, 2022 19:45
@thiagocrepaldi thiagocrepaldi marked this pull request as draft April 13, 2022 00:47
@FrancescoMandru
Copy link
Copy Markdown

Hi, any update on this merge request?

@thiagocrepaldi
Copy link
Copy Markdown
Contributor Author

Hi, any update on this merge request?

This PR will probably need to be split in 2. One that fixes the PointRend model (and requires a new symbolic ::grid_sampler) and a second that fixes everything else. The former should take some time, but the second should be fairly quick

@FrancescoMandru
Copy link
Copy Markdown

Hi, any update on this merge request?

This PR will probably need to be split in 2. One that fixes the PointRend model (and requires a new symbolic ::grid_sampler) and a second that fixes everything else. The former should take some time, but the second should be fairly quick

Hope this new feature will come soon! If I can help you for something don't esitate to contact me

@FrancescoMandru
Copy link
Copy Markdown

Hi, any upsate on this feature?

@thiagocrepaldi
Copy link
Copy Markdown
Contributor Author

I am working on this today. If you know how to add new onnx symbolics to pytorch, you could try working on adding grid_sampler

@thiagocrepaldi
Copy link
Copy Markdown
Contributor Author

#4132 breaks ONNX export. Root cause is related to how def to() symbolic is implemented or how detectron creates to(device) on their code

@thiagocrepaldi
Copy link
Copy Markdown
Contributor Author

I am working on this today. If you know how to add new onnx symbolics to pytorch, you could try working on adding grid_sampler

pytorch/pytorch#76159 adds grid_sample to PyTorch, so we should have this working very soon

@thiagocrepaldi
Copy link
Copy Markdown
Contributor Author

We need pytorch/pytorch#76498 to fix ONNX export after #4132 was merged

We also need pytorch/pytorch#76159 to add ::grid_sapler symbolic

After both of these, this PR is ready to go

@thiagocrepaldi
Copy link
Copy Markdown
Contributor Author

@garymm somehow I cannot request another review from you through GH's UI, but I have addressed your comments

@thiagocrepaldi thiagocrepaldi marked this pull request as ready for review April 28, 2022 18:45
Copy link
Copy Markdown

@garymm garymm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@thiagocrepaldi thiagocrepaldi force-pushed the main branch 4 times, most recently from 6087718 to 595eaef Compare April 28, 2022 20:39
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

17 similar comments
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

The main root cause was that torch.nn.GroupNorm changed its behavior on
newer pyTorch versions (pytorch/pytorch#74293).
this PR fixes the RetinaNetHead according to the new behavior

Also, add_export_config was not exposed when caffe2
was not compiled along with pytorch, preventing users with legacy
scripts to call it. This PR exposes it and adds a warning message
informing users about its deprecation on future versions

This PR also adds torch_export_onnx.py with several unit tests for
detectron2 models.

ONNX is added as dependency to the CI to allow the
aforementioned tests to run

Lastly, because detectron2 CI still uses old pytorch versions (1.8, 1.9
and 1.10), this PR adds custom symbolic functions to fix bugs on these
versions
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@thiagocrepaldi
Copy link
Copy Markdown
Contributor Author

Now I found this PR has many unreviewed changes and frankly I can see some of them are wrong. Could you move back to a state where it only contains reviewed changes? IIRC I only reviewed two PRs (for ONNX export and caffe2 export). FYI, @FrancescoMandru is not a developer of the project and just having their approval is not enough.

I fully understand that the delay in merging is not a good experience. But adding more changes to make a super huge PR is not helping. Let's just wait for maintainers when they have a chance.

Cheers @ppwwyyxx! Sorry for the trouble, my goal was to simplify the merge process by combining both approved PRs #4120/#4153 which were waiting for the merge. In the meantime, I found a couple minor bugs that I decided to append so that we can provide an end-to-end ONNX export implementation that works out of the box.

Instead of reverting it, I've just rebased the PR, splitting it in 5 commits with detailed explanation in each of them.
In summary, the first 2 are the same changes you have approved before (for ONNX and Caffe2). The next three fixes the ONNX export, remove unnecessary output from ONNX graph (and ONNX graph only) and force tracing of heatmaps_to_keypoints(), which is OK because ONNX uses dynamic_axes argument to ensure the graph will accept dynamic inputs

I kindly ask you that you give it a shot in reviewing them, and if you still think that I should split them again, I will do it.

Happy friday

@FrancescoMandru
Copy link
Copy Markdown

@FrancescoMandru is not a developer of the project and just having their approval is not enough.

Definitely true, I'm trying to help @thiagocrepaldi as far as I can!

Thiago Crepaldi added 4 commits May 27, 2022 10:16
This commit fixes Caffe2Tracer.export_caffe2 API for the latest pytorch repo

Before this change, all tests at tests/test_export_caffe2.py fail because
the latest onnx releases does not have onnx.optimizer namespace anymore
and detectron2 code tried to use fuse_bn_into_conv from it.

However, fuse_bn_into_conv optimization previously present in onnx.optimizer
is already performed by torch.onnx.export. Therefore it wwas removed
from detectron2 code.

Depends on pytorch/pytorch#75718
Fixes facebookresearch#3488
Fixes pytorch/pytorch#69674 (PyTorch repo)
Although facebookresearch#4120 exported a valid ONNX graph, after running it with ONNX
Runtime, I've realized that all model inputs were exported as constants.

The root cause was that at detectron2/structures/image_list.py, the
padding of batched images were done on a temporary (copy) variable,
which the ONNX tracer cannot track the user input, leading in
suppressing inputs and storing them as constant during export.

Also, len(torch.Tensor) is traced as a constant shape, which poses a
problem if the next input has different shape.
This PR fixes that by using torch.Tensor.shape[0] instead
TracingAdapter creates extra outputs (through flatten_to_tuple) to store
metadata to rebuild the original data format.

This is unnecessary during ONNX export as the original data will never
be reconstructed to its original format using Schema.__call__ API.
This PR suppresses such extra output constants during
torch.onnx.export() execution. Outside this API, the behavior is not
changed, ensuring BC.

Although not stricly necessary to achieve the same numerical results as
PyTorch, when a ONNX model schema is compared to PyTorch's, the diffrent
number of outputs (ONNX model will have more outputs than PyTorch) may
not only confuse users, but also result in false negative when coding
model comparison helpers.
PyTorch ONNX export has some limitations when exporting mdoels with
torch.jit.script. This PR suppresses the `torch.jit.script_if_tracing`
when torch.onnx.export is in execution. Outside this API context, the
original behavior is maintained to ensure BC

This fixes detectron2.modeling.roi_heads.KRCNNConvDeconvUpsampleHead
ONNX export
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

Comment on lines +356 to +365
try:
norm_name = str(type(get_norm(norm, 1)))
if "BN" in norm_name:
logger.warning(
f"Shared BatchNorm (type={norm_name}) may not work well in RetinaNetHead."
)
except ValueError:
# https://github.com/pytorch/pytorch/pull/74293
if isinstance(norm, str):
logger.warning(f"No BatchNorm was found for '{norm}' for RetinaNetHead.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please undo this change. There is no reason to ignore the exception.

Copy link
Copy Markdown
Contributor

@ppwwyyxx ppwwyyxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first commit cb755fe#diff-362c43f324b3aaa4d55eae710f24245e361fd9be77cb2ecaded84672ef88b0d5R356-R365 looks good but there is a problematic change that I don't remember have reviewed. Added a comment.

The second commit 8f86528 looks OK, but like I said caffe2 is deprecated so there is a chance that it may not get accepted. That should be a decision separated from the ONNX-only export so it's better to be a separate PR.

The third commit b352172 appears to be fixing something but I cannot see what problem it's trying to fix so I cannot accept it.
If it is trying to fix a caffe2 export correctness issue please put it in the caffe2-onnx PR, and in that case it's less likely to be accepted because it's more work for a deprecated functionality.
If it's trying to fix a onnx-only export, correctness issue, please put it in this onnx-only export PR and add a test to show what problems it's trying to fix.

Similarly I don't understand what the 4th commit f108953 is trying to do so I cannot accept it.
It seems to be about a user experience improvement but not a bug -- in that case then it should be a separate PR. Also I expect to see demonstration of the problem first, before jumping to solutions - there is a chance that the extra outputs are useful and should not be removed, but I cannot tell if I don't know what the problem is.

I cannot tell without running some tests myself whether the 5th commit 6fabe50 is correct or not. I strongly suspect it's incorrect, because the scripted function has a forloop with dynamic length which I assumed ONNX tracing cannot handle. But I don't know about "dynamic_axes" and might be wrong here. Anyway that's a conversion better happen separately as well, because this does not affect the basic (fast/mask r-cnn) models.

In summary, all of the unreviewed changes will need more iterations. Having them together will slow down the merge.

@thiagocrepaldi
Copy link
Copy Markdown
Contributor Author

Divided into 3 different PRs:

#4291
#4295
#4315

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to export PointRend model (from detectron2) to ONNX Exporting a model to ONNX; onnx.optimizer

5 participants