[1] enables varying input/output size in order to perform multiscale multiview image processing so as to to bolster classification confidence and to perform localisation and object detection. I wonder if and how could it be implemented in Caffe?
One possibility would be to set blob sizes to their maximum expected values and then account for the actual input size during computation at each layer. I am not familiar enough with Caffe sources to predict the overhead this approach might cause. I imagine it can lead to redundant memory copying and involved index arithmetic in order to access the right data.
What are other possibilities? I would be happy to PR it should we be able to work out a decent solution.
[1] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. arXiv:1312.6229 [cs.CV].