Skip to content

Adding maskrcnn model from torchvision#547

Closed
xuzhao9 wants to merge 38 commits intomainfrom
xz9/add-maskrcnn-cudagraph
Closed

Adding maskrcnn model from torchvision#547
xuzhao9 wants to merge 38 commits intomainfrom
xz9/add-maskrcnn-cudagraph

Conversation

@xuzhao9
Copy link
Copy Markdown
Contributor

@xuzhao9 xuzhao9 commented Nov 8, 2021

Add maskrcnn model from torchvision.
This is the first step towards adding dynamic shapes support for maskrcnn.

@xuzhao9 xuzhao9 changed the title [WIP] Adding maskrcnn model from torchvision Adding maskrcnn model from torchvision Nov 9, 2021
@xuzhao9 xuzhao9 requested a review from Krovatkin November 9, 2021 18:35
@xuzhao9
Copy link
Copy Markdown
Contributor Author

xuzhao9 commented Nov 9, 2021

The eval GPU utilization looks terrible:

image

There is a big gap in train as well:
image

Looking into it.

@xuzhao9
Copy link
Copy Markdown
Contributor Author

xuzhao9 commented Nov 11, 2021

Changed to using COCO 2017 data, the utilization for evaluation improves but still looks not very good:

image

Train utilization looks much better, but still there is a gap at the beginning (could be caused by data-copying):
image

@xuzhao9
Copy link
Copy Markdown
Contributor Author

xuzhao9 commented Nov 11, 2021

After adding data prefetch, the gap in the front is gone but the gap at the end still exists for eval:

image

At the same time, train profiling looks terrific:

image

@xuzhao9 xuzhao9 changed the title Adding maskrcnn model from torchvision [WIP] Adding maskrcnn model from torchvision Nov 11, 2021
Copy link
Copy Markdown
Contributor

@Krovatkin Krovatkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few qqs to understand this PR a bit better.

lr = 0.02
momentum = 0.9
weight_decay = 1e-4
params = [p for p in self.model.parameters() if p.requires_grad]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, is this some kind of an optimization? wouldn't autograd ignore parameters don't require gradients?

Copy link
Copy Markdown
Contributor Author

@xuzhao9 xuzhao9 Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Krovatkin I borrowed these default values from torchvision reference code: https://github.com/pytorch/vision/blob/main/references/detection/train.py#L77. These parameters seems to be default to torchvision maskrcnn training.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this anymore, including in torchvision.

Copy link
Copy Markdown
Contributor Author

@xuzhao9 xuzhao9 Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failed when removing these parameters(https://app.circleci.com/pipelines/github/pytorch/benchmark/2710/workflows/26c92ded-69b3-4cdc-8c35-3cc4ab5f8e13/jobs/2793), ValueError: parameter group didn't specify a value of required optimization parameter lr so they are still needed.

@xuzhao9 xuzhao9 requested a review from Krovatkin November 11, 2021 16:46
@fmassa
Copy link
Copy Markdown
Member

fmassa commented Nov 11, 2021

Hi @xuzhao9

I believe the slowdown during inference of mask r-cnn is due to this function

Indeed, we perform many small operations in a for loop, which is not ideal for maximizing GPU utilization.
In particular, I believe this line is the main culprit.

I think we can further optimize this function at the Python-level by leveraging a single call to grid_sample, but it would require rewriting the function. I believe detectron2 contains a more optimized version of this function in here

Also, ccing @datumbox as this could be a good task to optimize the torchvision function

@eircfb
Copy link
Copy Markdown
Contributor

eircfb commented Nov 11, 2021

After adding data prefetch, the gap in the front is gone but the gap at the end still exists for eval:

image

At the same time, train profiling looks terrific:

image

Question for model authors:

This still doesnt' look great. Is this how the model is used in nature?

What can be done here to improve utilization? Is this from a lack of prefetch from the second model? Its no good to have that idle section at the end.

[edit: the whole inference looks right on the edge of underutilized, do we need a different way to run this model?]

@eircfb
Copy link
Copy Markdown
Contributor

eircfb commented Nov 11, 2021

Hi @xuzhao9

I believe the slowdown during inference of mask r-cnn is due to this function

Indeed, we perform many small operations in a for loop, which is not ideal for maximizing GPU utilization. In particular, I believe this line is the main culprit.

I think we can further optimize this function at the Python-level by leveraging a single call to grid_sample, but it would require rewriting the function. I believe detectron2 contains a more optimized version of this function in here

Also, ccing @datumbox as this could be a good task to optimize the torchvision function

Right, sending off all these tiny shaders against independent data. Thats a perf error, causes dispatch to dominate. Can maybe be patched up by using cuda graphs, but really not good.

Is this really how the model runs in nature?

@fmassa
Copy link
Copy Markdown
Member

fmassa commented Nov 11, 2021

@eircfb I believe I've answered your questions in my answer just before, let me know if you have further questions. It's the post-processing steps that are the culprit in here, and they can definitely be optimized (without having to resort to custom CUDA kernels)

This is normally how we use the model, yes.

@eircfb
Copy link
Copy Markdown
Contributor

eircfb commented Nov 11, 2021

@eircfb I believe I've answered your questions in my answer just before, let me know if you have further questions. It's the post-processing steps that are the culprit in here, and they can definitely be optimized (without having to resort to custom CUDA kernels)

This is normally how we use the model, yes.

Right, we posted at same time. Isn't that loop of tiny dispatches just for the first section? What causes the low utilization towards the end of the trace?

@fmassa
Copy link
Copy Markdown
Member

fmassa commented Nov 12, 2021

@eircfb I believe the section I pointed out should correspond to the last part of the trace (although to be 100% sure I would need to check the full trace with further info, @xuzhao9 can you send it to me?)

For the first section with slowdown, I would need to check more carefully but my first guess would be that it comes from this section https://github.com/pytorch/vision/blob/0fa747eea4569f349fb0dacab44604b7b11c1a74/torchvision/models/detection/rpn.py#L253-L274

Maybe the slowdown could be attributed to this function being called instead of this one, but again, I would need to see the full trace to be sure

@xuzhao9
Copy link
Copy Markdown
Contributor Author

xuzhao9 commented Nov 12, 2021

@eircfb I believe the section I pointed out should correspond to the last part of the trace (although to be 100% sure I would need to check the full trace with further info, @xuzhao9 can you send it to me?)

For the first section with slowdown, I would need to check more carefully but my first guess would be that it comes from this section https://github.com/pytorch/vision/blob/0fa747eea4569f349fb0dacab44604b7b11c1a74/torchvision/models/detection/rpn.py#L253-L274

Maybe the slowdown could be attributed to this function being called instead of this one, but again, I would need to see the full trace to be sure

Here is the full trace of eval (bs==4, with prefetch): pytorch-benchmark-dev-0_1709.1636590532626.pt.trace.json.zip

@xuzhao9 xuzhao9 force-pushed the xz9/add-maskrcnn-cudagraph branch from 407ab50 to deadddc Compare November 15, 2021 19:12
@xuzhao9 xuzhao9 changed the title [WIP] Adding maskrcnn model from torchvision Adding maskrcnn model from torchvision Nov 15, 2021
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@xuzhao9 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@xuzhao9 xuzhao9 requested a review from fmassa November 16, 2021 01:25
@fmassa
Copy link
Copy Markdown
Member

fmassa commented Nov 16, 2021

@xuzhao9 after looking at the full trace, my comments from before seems to be accurate.

The first slow-down is due to this part of the implementation. I have a gut-feeling that this might be due to bad hyperparameters in here, so if you could try always dispatching to this codepath it would give hints if we can easily fix this.

For the second (and biggest slowdown), the problem is what I mentioned before on the mask pasting in images, and would require re-writing the function in a better way

@xuzhao9
Copy link
Copy Markdown
Contributor Author

xuzhao9 commented Nov 16, 2021

@xuzhao9 after looking at the full trace, my comments from before seems to be accurate.

The first slow-down is due to this part of the implementation. I have a gut-feeling that this might be due to bad hyperparameters in here, so if you could try always dispatching to this codepath it would give hints if we can easily fix this.

For the second (and biggest slowdown), the problem is what I mentioned before on the mask pasting in images, and would require re-writing the function in a better way

Thank you @fmassa! I think your explanations make sense (though I am not an expert). I think the code is in decent shape now and we are confident it represents common use cases. We can use it as a baseline to help further improve the model quality.

Please help review and stamp it so that we can proceed on developing dynamic shapes.

Thanks!

Copy link
Copy Markdown
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I've made a few comments that could simplify the implementation (by removing unused functions), and also to mention that both torchscript as well as CPU are supported.

Otherwise the rest LGTM so I'm approving this as it can be merged as is

@xuzhao9 xuzhao9 force-pushed the xz9/add-maskrcnn-cudagraph branch from 9d43a62 to 69d3dcb Compare November 17, 2021 21:02
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@xuzhao9 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Comment on lines +89 to +90
if not self.device == "cuda":
return NotImplementedError("CPU is not supported by this model")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, did you face any errors when running the code on the CPU?

Copy link
Copy Markdown
Contributor Author

@xuzhao9 xuzhao9 Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that during scripting, the model always returns (Losses, Detections) (as pointed out in the error message).

So you would need to change the code in the training to always return losses, detection = model(inputs), and go from there.

@xuzhao9 xuzhao9 force-pushed the xz9/add-maskrcnn-cudagraph branch from 89ef243 to 15fb801 Compare November 22, 2021 18:24
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@xuzhao9 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@xuzhao9 merged this pull request in 39b5b6d.

@xuzhao9 xuzhao9 deleted the xz9/add-maskrcnn-cudagraph branch December 10, 2021 00:42
@jdsgomes
Copy link
Copy Markdown

I've made a few comments that could simplify the implementation (by removing unused functions), and also to mention that both torchscript as well as CPU are supported.

I re-ran the profiling with the code changes from this PR and it indeed it improves the overall figure (visible in the middle part where we are running nms).

After changes
Screenshot 2021-12-10 at 22 35 12

Before changes
Screenshot 2021-12-10 at 22 35 29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants