Add SSDlite architecture with MobileNetV3 backbones#3757
Add SSDlite architecture with MobileNetV3 backbones#3757datumbox merged 27 commits intopytorch:masterfrom
Conversation
fmassa
left a comment
There was a problem hiding this comment.
Looks great, thanks a lot Vasilis!
I have a couple of comments, let me know what you think
| # Enable [-1, 1] rescaling and reduced tail if no pretrained backbone is selected | ||
| rescaling = reduce_tail = not pretrained_backbone |
There was a problem hiding this comment.
This is a bit confusing, but I assume the [-1, 1] rescaling is necessary to get best results given the current settings?
There was a problem hiding this comment.
That is correct. Rescaling was part of the changes needed to boost the accuracy by 1mAP.
| backbone = _mobilenet_extractor("mobilenet_v3_large", progress, pretrained_backbone, trainable_backbone_layers, | ||
| norm_layer, rescaling, _reduced_tail=reduce_tail, _width_mult=1.0) | ||
|
|
||
| size = (320, 320) |
There was a problem hiding this comment.
This means that the size is hard-coded and even if the user passes a different size **kwargs in the constructor it won't be used?
What about doing something like
size = kwargs.get("size", (320, 320))instead, so that the users can potentially customize the input size if they wish?
There was a problem hiding this comment.
I chose to hardcode it because this is the ssdlite320 model which uses a fixed 320x320 size. The input size is much less flexible on SSD models comparing to FasterRCNN because they make a few strong assumptions about the input.
If someone wants to use a different size, it would be simpler to just create the backbone, configure the DefaultBoxGenerator and then initialize directly the SSD with the config of their choice. Overall I felt that this approach would be simpler than trying to offer an API that tries to cover all user needs.
| kwargs = {**defaults, **kwargs} | ||
| model = SSD(backbone, anchor_generator, size, num_classes, | ||
| head=SSDLiteHead(out_channels, num_anchors, num_classes, norm_layer), | ||
| image_mean=[0., 0., 0.], image_std=[1., 1., 1.], **kwargs) |
There was a problem hiding this comment.
hum, interesting.
I would have expected that we could have removed the rescaling part and instead changed mean / std here to be image_mean=0.5, image_std=0.5, but I assume this wasn't done due to padded regions having a different value than what you would have liked, is that correct?
Also, this will probably mean that even if you were to use a pretrained backbone, it wouldn't give good results because you are passing a non-default image mean / std.
In this case, might be better to disable passing a pretrained backbone altogether?
There was a problem hiding this comment.
Correct. Here I'm trying to be as close to the canonical implementation as possible and that helped me close the gap in the accuracy.
You are right to say that a pretrained backbone would need different mean/std. Thankfully because our setup is to train end-to-end and use extensive BN, the backbone adapts to the different input fairly quickly even when one uses a pre-trained backbone. In the end, as I trained it for quite a few epochs, it was better to start from random weights which led to a better result (this is a common finding in similar setups).
Though indeed overall for the API it might be simpler to disable passing a pretrained backbone, this means that the API for SSDlite will be different from any other model. It will also create issues with our training scripts that expect to be able to pass this parameter. I think what I will do to address better this remark is make the mean/std configurable.
fmassa
left a comment
There was a problem hiding this comment.
Thanks for the answers Vasilis!
Chatted offline, let's get this PR merged and then follow up with creating a few issues to investigate some of the points that I brought.
Summary: * Partial implementation of SSDlite. * Add normal init and BN hyperparams. * Refactor to keep JIT happy * Completed SSDlite. * Fix lint * Update todos * Add expected file in repo. * Use C4 expansion instead of C4 output. * Change scales formula for Default Boxes. * Add cosine annealing on trainer. * Make T_max count epochs. * Fix test and handle corner-case. * Add support of support width_mult * Add ssdlite presets. * Change ReLU6, [-1,1] rescaling, backbone init & no pretraining. * Use _reduced_tail=True. * Add sync BN support. * Adding the best config along with its weights and documentation. * Make mean/std configurable. * Fix not implemented for half exception Reviewed By: cpuhrsch Differential Revision: D28538769 fbshipit-source-id: df6c2e79b76e6d6297aa51ca0ff4535dc59eaf9b
| ) | ||
|
|
||
| get_depth = lambda d: max(min_depth, int(d * width_mult)) # noqa: E731 | ||
| extra = nn.ModuleList([ |
There was a problem hiding this comment.
@datumbox could you please help me figure it out - I cannot find the info about these extra layers in the papers. Where did you get them from?
I'm trying to create a modification for this model and struggle to understand it - any help would be appreciated!
I want to reduce the number of encoder layers to make feature maps detect small objects.
There was a problem hiding this comment.
@datumbox Thank you for the quick reply! It's very helpful
There was a problem hiding this comment.
@datumbox in 6.3 of MobileNet3 paper, I only see the info on connecting C4 and C5 layers to the SSD head. There is nothing on these extra layers there.
There was a problem hiding this comment.
Have you checked the reference code I sent? This comes from their official repo.
There was a problem hiding this comment.
Yes, I see that in the TensorFlow implementation.
I'm trying to understand if I'm reducing the depth of C4 (and thus the output stride for targeting super small objects) - how should I change the rest of the layers?
There was a problem hiding this comment.
Sorry, it's been quite sometime since I wrote the implementation. I think you will need to dig into the original research repo to get the details.
Resolves #1422, fixes #3757
This PR implements SSDlite with MobileNetV3 backbone as outlined in the papers [1] and [2].
Trained using the code committed at 8aa3f58. The current best pre-trained model was trained with (using latest git hash):
Submitted batch job 40959060, 41037042, 41046786
Accuracy metrics at 4ca472e:
Validated with:
Speed benchmark:
0.09 sec per image on CPU